maximilianh / multiSub

Prepares a SARS-CoV-2 submission for GISAID, NCBI or ENA. Can read GISAID or NCBI files, or plain fasta+tsv/csv/xls. Finds files in input directory and merges everything into a single output directory. Auto-detects input file formats. Can submit the results to multiple repositories from the command line.
GNU General Public License v3.0
35 stars 2 forks source link

Can you provide submission of pathogenic sequences other than COVID-19 (e.g influenza virus)? #5

Open virologist opened 2 years ago

virologist commented 2 years ago

Hi, @maximilianh

It's a useful tool for submitting viral sequences in bulk. I am wondering to know if it was possible to provide submission of pathogenic sequences other than COVID-19 (e.g influenza virus)?

Best, Yang

maximilianh commented 2 years ago

Yes, you can change all the values via the config file. Do you have an example sequence and can you tell me where you want to submit it to?

Do you want one single submission with multiple species or mostly just one?

Do you think you'll switch between organisms or will you always submit just one organism?

On Fri, Apr 29, 2022 at 1:35 AM Biopig @.***> wrote:

Hi, @maximilianh https://github.com/maximilianh

It's a useful tool for submitting viral sequences in bulk. I am wondering to know if it was possible to provide submission of pathogenic sequences other than COVID-19 (e.g influenza virus)?

Best, Yang

— Reply to this email directly, view it on GitHub https://github.com/maximilianh/multiSub/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TNDT35DKV6FL5U7LS3VHONNJANCNFSM5UVE7NOA . You are receiving this because you were mentioned.Message ID: @.***>

virologist commented 2 years ago

Hi Maximilian

Thanks for your prompt response.

Can you tell me where you want to submit it to?

  • It depends on the submission journal's demand. Either NCBI or GISAID, usually. For me, I'd like to submit to GISAID (epiflu) database.

Do you want one single submission with multiple species or mostly just one?

  • Usually, we submit the multiple viral sequences (one species/subtype) in batch.

Do you think you'll switch between organisms or will you always submit just one organism?

  • For me, I usually focus on the influenza virus. Occasionally, we need to submit sequences other than influenza, which are only allowed to be deposited in NCBI.

Best wishes, Yang

maximilianh commented 2 years ago

Do you have a test sequence in your format and can you tell me which virus it is ? Then I can try a test submission and send you a sample config for it. The options are all there, I’ve just never used them myself.

On Fri 29 Apr 2022 at 20:01, Biopig @.***> wrote:

Hi Maximilian

Thanks for your prompt response.

Can you tell me where you want to submit it to?

  • It depends on the submission journal's demand. Either NCBI or GISAID, usually. For me, I'd like to submit to GISAID (epiflu) database.

Do you want one single submission with multiple species or mostly just one?

  • Usually, we submit the multiple viral sequences (one species/subtype) in batch.

Do you think you'll switch between organisms or will you always submit just one organism?

  • For me, I usually focus on the influenza virus. Occasionally, we need to submit sequences other than influenza, which are only allowed to be deposited in NCBI.

Best wishes, Yang

— Reply to this email directly, view it on GitHub https://github.com/maximilianh/multiSub/issues/5#issuecomment-1113899595, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TI4AAKT6Q64DNMXPD3VHSPBNANCNFSM5UVE7NOA . You are receiving this because you were mentioned.Message ID: @.***>

virologist commented 2 years ago

Here is the HA gene of H6N2 avian influenza virus which we had submitted to GISAID. In case you need other information which usually is a mandatory demand for the submission to GISAID, more metadata is supplied. Sample location: Poyang Lake, Jiangxi, China Sample date: 2/1/2018 Sample source: wild bird fecal Host: Eurasian Teal

>A/Eurasian_Teal/Jiangxi/2018WB0049/2018(H6N2)
TTGGCAGCAGCCGGGAAGTCAGACAAGATCTGCATTGGATATCATGCCAACAACTCAACAACACAAGTGGATACTATCCTTGAGAAAAATGTCACCGTCACGCACTCAGTTGAATTGCTAGAAACCCAGAAGGAGGAGAGATTCTGCAACATCCTGAACAAGGGCCCTCTCGACCTAAAGGGATGCACCATAGAGGGTTGGATACTGGGGAATCCCCAATGCGACCTGTTGCTTGGTGATCAAAGCTGGTCATATATAGTGGAAAGACCTAGTGCTCAAAATGGGATTTGCTACCCAGGAACCTTGAACGACGCAGAAGAACTTAAGGCACTCATTGAATCAGGAGAAAGAGTAGAGAGATTTGAGATGTTTCCCAAAAGCACATGGGCAGGAGTTGACACCAGCAGTGGGGTGACAAAGGCTTGCCCCTATATTAGTGGTTCATCTTTCTATAGAAATCTCTTATGGATAATAAAGACCAAGTCAGCAGCATACCCAGTGATCAAAGGGACTTACAACAACACTGGAAACCAGCCAATCCTTTATTTCTGGGGTGTGCACCATCCTCCAGACACCAATGAACAAAATACTCTGTATGGCTCTGGTGATAGATACGTTAGGATGGGAACTGAAAGCATGAACTTCGCCAAGAGTCCAGAAATTGCAGCAAGACCTGCTGTGAACGGTCAAAGAGGCAGAATTGATTATTACTGGTCTGTTTTAAAACCAGGTGAAACCTTGAATGTGGAATCTAATGGAAATCTAATTGCCCCTTGGTATGCATACAAATTTGTCAGCACAAATAATAAGGGAGCCATCTTCAAGTCAAGTTTACCAATCGAGAACTGTGATGCCACATGCCAGACTATTGCAGGGGTCCTAAGAACCAATAAAACATTTCAGAATGTAAGTCCTCTGTGGATAGGAGAATGCCCCAAATATGTGAAAAGTGAAAGTTTGAGGCTTGCAACTGGACTGAGGAACGTTCCACAGATTGGAACTAGAGGTCTTTTTGGGGCCATAGCAGGATTTATTGAAGGAGGATGGACTGGAATGATAGATGGGTGGTATGGCTATCACCATGAGAATTCCCAGGGGTCAGGATATGCAGCAGACAAAGAGAGCACTCAAAGGGCTATAGACGGAATTACAAATAAAGTCAATTCCATCATTGATAAAATGAACACACAATTTGAAGCTGTTGACCACGAATTCTCAAATATAGAGAGAAGAATTGACAATCTGAACAAAAGGATGGAAGATGGATTCCTAGATGTTTGGACATACAATGCTGAACTGCTGGTTCTTCTTGAAAACGAAAGGACACTAGACCTGCACGATGCAAATGTAAAGAACCTATATGAGAAGGTCAAATCGCAATTAAGGGACAATGCTAATGATCTGGGAAATGGGTGCTTTGAATTCTGGCATAAGTGTGACAATGAGTGTATGGAATCTGTTAAGAATGGTACTTATGATTATCCCAAGTACCAGGACGAGAGCAAATTGAACAGGCAGGAAATAGAATCGGTAAAGCTAGAAAATCTTGGTGTGTATCAAATCCTTGCTATTTATAGTACGGTATCGAGCAGTCTGGTGTTGGTAGGGCTGATCATAGCAATGGGTCTTTGGATGTGTTCAAATGGTTCAA

By the way, since the segment nature of the influenza genome, we usually need to submit the other gene segment at the same time. Therefore, I put the NA gene of this strain here.

>A/Eurasian_Teal/Jiangxi/2018WB0049/2018(H6N2)
TCTGTCTCTCTAACCATTGCAACAGTATGTTTCCTCATGCAAATTGCCATCCTAGCGACAACTATAACACTGCACTTCAAGCAGAATGAATGCAGCATTCCCTCGAACAATCAAGTAGTGCCATGTGAGCCAATCATAGTAGAAAGGAACATAACAGAGATAGTGTATTTGAACAACACCACCATAGAAAAAGAACTTTGTCCTAAATTGACAGAATACAGGGATTGGTTGAAACCACAGTGTCAGATCACAGGATTTGCTCCTTTCTCCAAGGACAACTCAATCCGGCTTTCTGCTGGTGGGGACATTTGGGTAACAAGGGAACCTTATGTATCATGCAGTCCCAATAAGTGTTATCAGTTCGCACTTGGGCAGGGAACCACGCTGGACAACAAACATTCAAACGGCACAATACATGATAGGATTCCCCATCGGACCCTTTTGATGAACGAGTTGGGTGTTCCGTTTCATTTAGGGACCAAACAAGTGTGCATAGCATGGTCCAGCTCAAGCTGCCATGATGGAAGAGCATGGCTTCACGTTTGTGTTACTGGGGATGATAGGAATGCAACCGCCAGTTTCATTTATAATGGGGTGCTTGTTGACAGCATTGGTTCATGGTCCCAAAACATTCTCAGAACTCAGGAGTCAGAATGCGTCTGCATCAATGGAACTTGTACAGTAGTAATGACTGATGGAAGTGCATCAGGAAGGGCTGATACTAGAATACTATTCATTAAAGAAGGGAAAATTGTTCATATCAGCCCATTATCAGGAAGTGCCCAGCATATAGAGGAGTGTTCCTGTTATCCCCGCTATCCAGACGTCAGATGTGTCTGCAGAGACAATTGGAAAGGTTCAAATAGGCCCGTTATAGATATAAATATGGCAGATTATAGCATTGATTCTAGTTATGTGTGCTCAGGGCTTGTTGGAGACACACCGAGAAACGATGATAGCTCTAGCAATAGTAACTGCAAGGATCCTAATAATGAGAGAGGGAACCCAGGAGTGAAAGGGTGGGCATTTGACTATGGAAATGATGTTTGGATGGGAAGAACAATCAGCAAGGATTCTCGCTCAGGTTATGAGACCTTCAGAGTCATTGGCGGTTGGACAACAGCTAATTCCAAATCTCAAGTAAATAGACAAGTCATAGTTGACAATAATAACTGGTCTGGTTATTCTGGCATCTTCTCTGTTGAAGGCAAAAGCTGCATCAATAGGTGTTTTTATGTGGAGTTGATAAGGGGAAGGCCACAAGAGACTAGAGTATGGTGGACTTCAAACAGTATTGTCGTGTTTTGTGGAACTTCAGGTACTTATGGGACAGGCTCATGGCCTGATGGGGCGAATATTAATT

Thank you very much for your help!

Best, Yang

maximilianh commented 2 years ago

Great. So you want this submitted to NCBI or GISAID?

I would edit your meta data to put it into csv or tsv format and save the sequence to fasta, right? So two rows for the meta file and two sequences for the fasta.

On Sat, Apr 30, 2022 at 1:08 AM Biopig @.***> wrote:

Here is the HA gene of H6N2 avian influenza virus which we had submitted to GISAID. In case you need other information which usually is a mandatory demand for the submission to GISAID, more metadata is supplied. Sample location: Poyang Lake, Jiangxi, China Sample date: 2/1/2018 Sample source: wild bird fecal Host: Eurasian Teal

A/Eurasian_Teal/Jiangxi/2018WB0049/2018(H6N2) TTGGCAGCAGCCGGGAAGTCAGACAAGATCTGCATTGGATATCATGCCAACAACTCAACAACACAAGTGGATACTATCCTTGAGAAAAATGTCACCGTCACGCACTCAGTTGAATTGCTAGAAACCCAGAAGGAGGAGAGATTCTGCAACATCCTGAACAAGGGCCCTCTCGACCTAAAGGGATGCACCATAGAGGGTTGGATACTGGGGAATCCCCAATGCGACCTGTTGCTTGGTGATCAAAGCTGGTCATATATAGTGGAAAGACCTAGTGCTCAAAATGGGATTTGCTACCCAGGAACCTTGAACGACGCAGAAGAACTTAAGGCACTCATTGAATCAGGAGAAAGAGTAGAGAGATTTGAGATGTTTCCCAAAAGCACATGGGCAGGAGTTGACACCAGCAGTGGGGTGACAAAGGCTTGCCCCTATATTAGTGGTTCATCTTTCTATAGAAATCTCTTATGGATAATAAAGACCAAGTCAGCAGCATACCCAGTGATCAAAGGGACTTACAACAACACTGGAAACCAGCCAATCCTTTATTTCTGGGGTGTGCACCATCCTCCAGACACCAATGAACAAAATACTCTGTATGGCTCTGGTGATAGATACGTTAGGATGGGAACTGAAAGCATGAACTTCGCCAAGAGTCCAGAAATTGCAGCAAGACCTGCTGTGAACGGTCAAAGAGGCAGAATTGATTATTACTGGTCTGTTTTAAAACCAGGTGAAACCTTGAATGTGGAATCTAATGGAAATCTAATTGCCCCTTGGTATGCATACAAATTTGTCAGCACAAATAATAAGGGAGCCATCTTCAAGTCAAGTTTACCAATCGAGAACTGTGATGCCACATGCCAGACTATTGCAGGGGTCCTAAGAACCAATAAAACATTTCAGAATGTAAGTCCTCTGTGGATAGGAGAATGCCCCAAATATGTGAAAAGTGAAAGTTTGAGGCTTGCAACTGGACTGAGGAACGTTCCACAGATTGGAACTAGAGGTCTTTTTGGGGCCATAGCAGGATTTATTGAAGGAGGATGGACTGGAATGATAGATGGGTGGTATGGCTATCACCATGAGAATTCCCAGGGGTCAGGATATGCAGCAGACAAAGAGAGCACTCAAAGGGCTATAGACGGAATTACAAATAAAGTCAATTCCATCATTGATAAAATGAACACACAATTTGAAGCTGTTGACCACGAATTCTCAAATATAGAGAGAAGAATTGACAATCTGAACAAAAGGATGGAAGATGGATTCCTAGATGTTTGGACATACAATGCTGAACTGCTGGTTCTTCTTGAAAACGAAAGGACACTAGACCTGCACGATGCAAATGTAAAGAACCTATATGAGAAGGTCAAATCGCAATTAAGGGACAATGCTAATGATCTGGGAAATGGGTGCTTTGAATTCTGGCATAAGTGTGACAATGAGTGTATGGAATCTGTTAAGAATGGTACTTATGATTATCCCAAGTACCAGGACGAGAGCAAATTGAACAGGCAGGAAATAGAATCGGTAAAGCTAGAAAATCTTGGTGTGTATCAAATCCTTGCTATTTATAGTACGGTATCGAGCAGTCTGGTGTTGGTAGGGCTGATCATAGCAATGGGTCTTTGGATGTGTTCAAATGGTTCAA

By the way, since the segment nature of the influenza genome, we usually need to submit the other gene segment at the same time. Therefore, I put the NA gene of this strain here.

A/Eurasian_Teal/Jiangxi/2018WB0049/2018(H6N2) TCTGTCTCTCTAACCATTGCAACAGTATGTTTCCTCATGCAAATTGCCATCCTAGCGACAACTATAACACTGCACTTCAAGCAGAATGAATGCAGCATTCCCTCGAACAATCAAGTAGTGCCATGTGAGCCAATCATAGTAGAAAGGAACATAACAGAGATAGTGTATTTGAACAACACCACCATAGAAAAAGAACTTTGTCCTAAATTGACAGAATACAGGGATTGGTTGAAACCACAGTGTCAGATCACAGGATTTGCTCCTTTCTCCAAGGACAACTCAATCCGGCTTTCTGCTGGTGGGGACATTTGGGTAACAAGGGAACCTTATGTATCATGCAGTCCCAATAAGTGTTATCAGTTCGCACTTGGGCAGGGAACCACGCTGGACAACAAACATTCAAACGGCACAATACATGATAGGATTCCCCATCGGACCCTTTTGATGAACGAGTTGGGTGTTCCGTTTCATTTAGGGACCAAACAAGTGTGCATAGCATGGTCCAGCTCAAGCTGCCATGATGGAAGAGCATGGCTTCACGTTTGTGTTACTGGGGATGATAGGAATGCAACCGCCAGTTTCATTTATAATGGGGTGCTTGTTGACAGCATTGGTTCATGGTCCCAAAACATTCTCAGAACTCAGGAGTCAGAATGCGTCTGCATCAATGGAACTTGTACAGTAGTAATGACTGATGGAAGTGCATCAGGAAGGGCTGATACTAGAATACTATTCATTAAAGAAGGGAAAATTGTTCATATCAGCCCATTATCAGGAAGTGCCCAGCATATAGAGGAGTGTTCCTGTTATCCCCGCTATCCAGACGTCAGATGTGTCTGCAGAGACAATTGGAAAGGTTCAAATAGGCCCGTTATAGATATAAATATGGCAGATTATAGCATTGATTCTAGTTATGTGTGCTCAGGGCTTGTTGGAGACACACCGAGAAACGATGATAGCTCTAGCAATAGTAACTGCAAGGATCCTAATAATGAGAGAGGGAACCCAGGAGTGAAAGGGTGGGCATTTGACTATGGAAATGATGTTTGGATGGGAAGAACAATCAGCAAGGATTCTCGCTCAGGTTATGAGACCTTCAGAGTCATTGGCGGTTGGACAACAGCTAATTCCAAATCTCAAGTAAATAGACAAGTCATAGTTGACAATAATAACTGGTCTGGTTATTCTGGCATCTTCTCTGTTGAAGGCAAAAGCTGCATCAATAGGTGTTTTTATGTGGAGTTGATAAGGGGAAGGCCACAAGAGACTAGAGTATGGTGGACTTCAAACAGTATTGTCGTGTTTTGTGGAACTTCAGGTACTTATGGGACAGGCTCATGGCCTGATGGGGCGAATATTAATT

Thank you very much for your help!

Best, Yang

— Reply to this email directly, view it on GitHub https://github.com/maximilianh/multiSub/issues/5#issuecomment-1113945750, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TJPRKEOYJFHCAIQ5DDVHTS75ANCNFSM5UVE7NOA . You are receiving this because you were mentioned.Message ID: @.***>

virologist commented 2 years ago

Great. So you want this submitted to NCBI or GISAID?

yes, I think GISAID is priority.

I would edit your meta data to put it into csv or tsv format and save the sequence to fasta, right?

right

So two rows for the meta file and two sequences for the fasta.

For GISAID, there is an official guideline for batch upload. Another protocol for uploading multiple samples (Batch upload).

maximilianh commented 2 years ago

Yes, sorry, I know how the GISAID side works, I was asking how you wanted to provide the files to multiSub. So a meta file with two rows and one fasta file with two sequences.

On Sat, Apr 30, 2022 at 8:44 PM Biopig @.***> wrote:

Great. So you want this submitted to NCBI or GISAID?

yes, I think GISAID is priority.

I would edit your meta data to put it into csv or tsv format and save the sequence to fasta, right?

right

So two rows for the meta file and two sequences for the fasta.

For GISAID, there is an official guideline https://www.gisaid.org/epiflu-applications/submitting-data-to-epiflutm/ for batch upload. Another protocol https://www.protocols.io/view/sars-cov2-gisaid-submission-protocol-kqdg35oy1v25/v3?step=1.2 for uploading multiple samples (Batch upload).

— Reply to this email directly, view it on GitHub https://github.com/maximilianh/multiSub/issues/5#issuecomment-1114120142, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TP3ZKU6COUZCMDJBZDVHX4Y7ANCNFSM5UVE7NOA . You are receiving this because you were mentioned.Message ID: @.***>

maximilianh commented 2 years ago

Hi @biopig, do you know how to checkout the "main" branch? I just committed a test case for this.

If you add this to your ~/.multisub/config (see also config.sample):

organism = "Influenza A virus" longOrg = "Influenza A virus"

And then go to this directory:

https://github.com/maximilianh/multiSub/tree/main/tests/biopig

And then run this command:

../../multiSub conv seq.fa meta.tsv out

Then a NCBI submission file like this gets created in out/ncbiSeqAndSource.fa:

A/Eurasian_Teal/Jiangxi/2018WB0049/2018(H6N2) [isolate=Influenza A virus/Eurasian Teal/USA/2018WB0049/2018] [country=China: Poyang Lake, Jiangxi] [collection_date=2018-2-1] [host=Eurasian Teal] [organism=Influenza A virus] Influenza A virusisolate Influenza A virus/Eurasian Teal/USA/2018WB0049/2018, complete genome TTGGCAGCAGCCGGGAAGTCAGACAAGATCTGCATTGGATATCATGCCAACAACTCAACA ACACAAGTGGATACTATCCTTGAGAAAAATGTCACCGTCACGCACTCAGTTGAATTGCTA GAAACCCAGAAGGAGGAGAGATTCTGCAACATCCTGAACAAGGGCCCTCTCGACCTAAA...

As for GISAID, I cannot download the GISAID template file for Influenza today. I can't even go to the GISAID Influenza website (epicov) today, the link at https://www.epicov.org/epi3/frontend doesn't work. If you have the csv or Excel template from their site for me, I can fix up the GISAID uploader.

On Sat, Apr 30, 2022 at 11:56 PM Maximilian Haeussler @.***> wrote:

Yes, sorry, I know how the GISAID side works, I was asking how you wanted to provide the files to multiSub. So a meta file with two rows and one fasta file with two sequences.

On Sat, Apr 30, 2022 at 8:44 PM Biopig @.***> wrote:

Great. So you want this submitted to NCBI or GISAID?

yes, I think GISAID is priority.

I would edit your meta data to put it into csv or tsv format and save the sequence to fasta, right?

right

So two rows for the meta file and two sequences for the fasta.

For GISAID, there is an official guideline https://www.gisaid.org/epiflu-applications/submitting-data-to-epiflutm/ for batch upload. Another protocol https://www.protocols.io/view/sars-cov2-gisaid-submission-protocol-kqdg35oy1v25/v3?step=1.2 for uploading multiple samples (Batch upload).

— Reply to this email directly, view it on GitHub https://github.com/maximilianh/multiSub/issues/5#issuecomment-1114120142, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TP3ZKU6COUZCMDJBZDVHX4Y7ANCNFSM5UVE7NOA . You are receiving this because you were mentioned.Message ID: @.***>

virologist commented 2 years ago

Hi, @maximilianh

Fantastic, let me give it a try. Here is the GISAID uploader. gisaid_batch_uploader.xls

Thanks, Yang

maximilianh commented 2 years ago

Hi Biopig, oh darn, the GISAID flu template is totally different from the Covid one. Are you sure you need GISAID upload? If you have table files already in GISAID format, why use multiSub at all?

On Mon, May 2, 2022 at 3:52 AM Biopig @.***> wrote:

Hi, @maximilianh https://github.com/maximilianh

Fantastic, let me give it a try. Here is the GISAID uploader. gisaid_batch_uploader.xls https://github.com/maximilianh/multiSub/files/8600423/gisaid_batch_uploader.xls

Thanks, Yang

— Reply to this email directly, view it on GitHub https://github.com/maximilianh/multiSub/issues/5#issuecomment-1114423714, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TOYTQ4FL2VHLJ43UWTVH4YL7ANCNFSM5UVE7NOA . You are receiving this because you were mentioned.Message ID: @.***>