rcs333 / VAPiD

VAPiD: Viral Annotation and Identification Pipeline
MIT License
50 stars 15 forks source link

VAPiD not processing final sequence (of 5) and no sequences yield .sqn files #10

Closed mihinduk closed 3 years ago

mihinduk commented 3 years ago

Hi, I have a set of 5 novel picorna viruses that I am trying to annotate for submission to GenBank. I am attaching my fasta, template and metadata files. (NOTE: I had to add .txt to the end of the template and metadata files to attach them here)

Here is the command I used: python3 vapid3.py /mnt/pathogen1/kathiem/simian_picornaviruses/novel_Picornaviridae.fasta /mnt/pathogen1/kathiem/simian_picornaviruses/template.sbt --metadata_loc /mnt/pathogen1/kathiem/simian_picornaviruses/novel_Picornaviridae_metadata.csv --spell_check --all

When the process starts, I get this message (5 times): "Automatic strain naming failed but submission will proceed without metadata appended to the fasta header."

For the first 4 sequences, all finish with this message: "/bin/sh: tbl2asn: command not found Done with: NOLA1"

These are the files present in the NOLA1 folder: assembly.cmt NOLA1.ali NOLA1_aligner.fasta NOLA1.blastresults NOLA1.fasta NOLA1.fsa NOLA1_ref.fasta NOLA1_ref.gbk NOLA1.tbl

The final sequence crashes with this message: "Searching local blast database at all_virus.fasta AF326754.2 was the selected reference Simian enterovirus SV19 was the parsed name of the virus Aligning reference and query... Done alignment Traceback (most recent call last): File "vapid3.py", line 965, in meta_list[x], coverage_list[x], sbt_file_loc, full_name_list[x],nuc_acid_type) File "vapid3.py", line 643, in annotate_a_virus gene_loc_list, gene_product_list, all_loc_list, all_product_list, name_of_the_feature_list = pull_correct_annotations(strain, our_seq, ref_seq, genome) File "vapid3.py", line 385, in pull_correct_annotations all_loc_list[entry][y] = adjust(int(all_loc_list[entry][y]), our_seq_num_array, ref_seq_num_array, genome) File "vapid3.py", line 300, in adjust if our_num_array[given_num] == '-1': IndexError: list index out of range"

Can you please tell me where my error lies? I have already replaced \r\n with \n in my files.
Your advice would be greatly appreciated,

Kathie novel_Picornaviridae_metadata.csv

novel_Picornaviridae.fasta.txt template.sbt.txt

rcs333 commented 3 years ago

Hi Kathie!

I'm sorry this isn't working out of the box for you. Thanks for the extremely detailed information and attached files :D

The first four sequences are not producing .sqn files because VAPiD can't find tbl2asn, the NCBI program I use to convert the created .fsa and .tbl files into a GenBank compatible .sqn file. Pleae make sure that this is correctly installed and you can run it from your shell. Let me know if you need additional help with this, I'm very happy to help get my program working for people!

I'm not quite sure why the last file is crashing but I will do some troubleshooting with the files you attached and get back to you on this.

Just as a word of warning this program was designed and tested for well characterized human viruses and works best for these use cases. That said, I've used VAPiD to annotate novel viruses many many times and I'm pretty confident we can get it working for your use case here, sometimes it just requires a little mucking around with the code to get everything packaged up and looking pretty.

I'll update you with what I find after I do a bit of troubleshooting later tonight. Hopefully we can get this working without too much of a hassle!

Ryan

rcs333 commented 3 years ago

Okay so I ran your sequences and I think I should explain in a little more detail how VAPiD works and some of the assumptions that it makes.

As the name would suggest, my program is not particularly smart - basically all it does is blastn search the NCBI viral database and then copy and paste the annotations from the best hit onto your submission. This makes it great for the use case I designed it for, annotating and submitting 500 RSV genomes that are all >95% identical to each other and the best blastn hit already in NCBI.

I did some quick blastn on the sequences you uploaded and the % nucleotide identity to the best hit for these sequences is in the low 80s. In my experience vapid doesn't perform super well in these cases and it often ends up being much faster if I just write the .tbl file by hand and then run tbl2asn myself lol

It looks like NOLA3 and NOLA1 completed and look okay except we name submissions based on what the top blast hit is named so idk if you really want these guys to be called "Enterovirus J"

NOLA3.gbf.txt NOLA3.sqn.txt NOLA1.gbf.txt NOLA1.sqn.txt

The others aren't working because the ORFs on the "references" aren't lining up with the actual open reading frames of the sequences you're trying to annotate. This can happen for a number of reasons. Although I don't have access to good analysis software right now, it looks like in the example of NOLA5 the genome you're trying to annotate doesn't fully contain the poly protein ORF that the "reference" has. This makes my program whack out because one of the key assumptions vapid makes is that the sequence you're trying to annotate has complete ORFs. Honestly I've been trying to find a graceful solution to this problem for a while now and I always end up just giving up and fixing it by hand.

So in that vein I manually wrote out some tbl files and stuffed them through tbl2asn and have attached them here. (again these are all called "simian enterovirus SV19" because that's what the top blast hit was called.

NOLA2.gbf.txt NOLA2.sqn.txt NOLA4.gbf.txt NOLA4.sqn.txt NOLA5.gbf.txt NOLA5.sqn.txt

I'm sorry my program isn't super helpful for your use case although I hope the attached files are at least some help. I can change the names really easily and re-send them to you if that's all you need :)

Anyways, please let me know if this was helpful and if I can assist in any other ways. I love viral discovery and I'm always happy to answer questions or help out with annotation work!

Ryan

mihinduk commented 3 years ago

Hi Ryan, I fixed the problem with tbl2asn. I just needed to remove the platform from the name.

I now have these files: assembly.cmt errorsummary.val NOLA1.ali NOLA1_aligner.fasta NOLA1.blastresults NOLA1.fasta NOLA1.fsa NOLA1.gbf NOLA1_ref.fasta NOLA1_ref.gbk NOLA1.sqn NOLA1.tbl NOLA1.val

I tried setting up a job with just NOLA5 and received the same set of errors.

Thank you for your help, Kathie

mihinduk commented 3 years ago

Hi Ryan,

Thank you very much for your help. I can use these as a backbone, make some manual adjustments and submit them. Most of the viruses we work isolate will be from human hosts, so I will definitely use your tool for them.

It would be really helpful if you had an example list of expected output files, since I wasn’t sure whether all the files were present or not.

I appreciate all your help and quick response, Kathie

From: Ryan C Shean @.> Reply-To: rcs333/VAPiD @.> Date: Friday, May 21, 2021 at 12:47 AM To: rcs333/VAPiD @.> Cc: "Mihindukulasuriya, Kathie" @.>, Author @.***> Subject: Re: [rcs333/VAPiD] VAPiD not processing final sequence (of 5) and no sequences yield .sqn files (#10)

Okay so I ran your sequences and I think I should explain in a little more detail how VAPiD works and some of the assumptions that it makes.

As the name would suggest, my program is not particularly smart - basically all it does is blastn search the NCBI viral database and then copy and paste the annotations from the best hit onto your submission. This makes it great for the use case I designed it for, annotating and submitting 500 RSV genomes that are all >95% identical to each other and the best blastn hit already in NCBI.

I did some quick blastn on the sequences you uploaded and the % nucleotide identity to the best hit for these sequences is in the low 80s. In my experience vapid doesn't perform super well in these cases and it often ends up being much faster if I just write the .tbl file by hand and then run tbl2asn myself lol

It looks like NOLA3 and NOLA1 completed and look okay except we name submissions based on what the top blast hit is named so idk if you really want these guys to be called "Enterovirus J"

NOLA3.gbf.txthttps://github.com/rcs333/VAPiD/files/6520225/NOLA3.gbf.txt NOLA3.sqn.txthttps://github.com/rcs333/VAPiD/files/6520226/NOLA3.sqn.txt NOLA1.gbf.txthttps://github.com/rcs333/VAPiD/files/6520105/NOLA1.gbf.txt NOLA1.sqn.txthttps://github.com/rcs333/VAPiD/files/6520107/NOLA1.sqn.txt

The others aren't working because the ORFs on the "references" aren't lining up with the actual open reading frames of the sequences you're trying to annotate. This can happen for a number of reasons. Although I don't have access to good analysis software right now, it looks like in the example of NOLA5 the genome you're trying to annotate doesn't fully contain the poly protein ORF that the "reference" has. This makes my program whack out because one of the key assumptions vapid makes is that the sequence you're trying to annotate has complete ORFs. Honestly I've been trying to find a graceful solution to this problem for a while now and I always end up just giving up and fixing it by hand.

So in that vein I manually wrote out some tbl files and stuffed them through tbl2asn and have attached them here. (again these are all called "simian enterovirus SV19" because that's what the top blast hit was called.

NOLA2.gbf.txthttps://github.com/rcs333/VAPiD/files/6520284/NOLA2.gbf.txt NOLA2.sqn.txthttps://github.com/rcs333/VAPiD/files/6520285/NOLA2.sqn.txt NOLA4.gbf.txthttps://github.com/rcs333/VAPiD/files/6520288/NOLA4.gbf.txt NOLA4.sqn.txthttps://github.com/rcs333/VAPiD/files/6520290/NOLA4.sqn.txt NOLA5.gbf.txthttps://github.com/rcs333/VAPiD/files/6520291/NOLA5.gbf.txt NOLA5.sqn.txthttps://github.com/rcs333/VAPiD/files/6520292/NOLA5.sqn.txt

I'm sorry my program isn't super helpful for your use case although I hope the attached files are at least some help. I can change the names really easily and re-send them to you if that's all you need :)

Anyways, please let me know if this was helpful and if I can assist in any other ways. I love viral discovery and I'm always happy to answer questions or help out with annotation work!

Ryan

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/rcs333/VAPiD/issues/10#issuecomment-845671668, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANDVLDLL2424VOGBIOIN2M3TOXXVZANCNFSM45IDTPRQ.


The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.

rcs333 commented 3 years ago

Okay great! Thanks for the feedback, I have updated the readme with a screenshot of what the expected output should be with the included example files.

I'm marking this issue as closed but don't hesitate to reach out if you have any more suggestions or questions about the program.

Happy annotating! Ryan