rcs333 / VAPiD

VAPiD: Viral Annotation and Identification Pipeline
MIT License
50 stars 15 forks source link

facing issue with custom database #9

Closed tushar-ahmed closed 4 years ago

tushar-ahmed commented 4 years ago

d3w@d3w:~/VAPiD$ python vapid3.py reference.fasta template.sbt --db blastdb/dbase metadata not found in provided .csv or .csv not created - time for minimal manual entry for sequence - NC_045512.2 Enter collection date in the format (23-Mar-2005, Mar-2005, or 2005): 2019 Enter country sample was collected in (example - USA): china Enter strain name - if unknown just put NC_045512.2: NC_045512.2 Enter coverage as a number (example 42.3), if unknown just leave this blank and hit enter: Searching local blast database at blastdb/dbase Traceback (most recent call last): File "vapid3.py", line 965, in meta_list[x], coverage_list[x], sbt_file_loc, full_name_list[x],nuc_acid_type) File "vapid3.py", line 630, in annotate_a_virus name_of_virus, our_seq, ref_seq, ref_accession, need_to_rc = blast_n_stuff(strain, strain + SLASH + strain + '.fasta') File "vapid3.py", line 106, in blast_n_stuff ref_seq_gb = line.split('|')[3] IndexError: list index out of range

I created a database only with sars-cov-2 reference sequence and tried to annotate. But these errors are occurring.

However, when I use --r flag with accession number, it works. what should I do? I need to work with that reference sequence locally due to bandwidth limitation. please help

fawaz-dabbaghieh commented 4 years ago

I was facing the same problem, it's hardcoded that there's a "|" separation in reference names and that the 4th field when you do line.split("|")[3] returns the accession number. You have two choices, change your database sequence names to fit that criterion, or change line 106 so it returns the accession number

rcs333 commented 4 years ago

Hi! Sorry for missing the first post and thanks for the solution posted above!

You can also use the —f flag to specify a specific reference file that you would like to use. So in this case you could download NC_045512.2.gbf once and then use the —f flag to always annotate off that file. This completely skips the blast search and is a good option once you know the reference you’re using works well for your files.

How exactly did you create your reference database? Could you upload it? I’d like to take a look and ensure that people can easily make their own reference databases and use them without modifying the code. I’m glad this tool seems like it would be useful and I want to make sure it works for everyone! R

fawaz-dabbaghieh commented 4 years ago

Hey, thanks for the response.

I think the problem is easy to fix, it's just that in the database you have, the references' names have a | separation and the 3rd field (0 index) is the accession number, however, if someone builds their own database (Like I did where I replaced all the NCBI viral reference already in all_viruses with a newer version and added covid, I downloaded the viral.1.1.genomic from NCBI, and the names didn't have | separation anymore and the code breaks), so the problem is just in the naming.

However, I did encounter some minor problems in the code and fixed them. I will try to fix a couple of things and send you a pull request, then if you think the fixes make sense, feel free to merge then :)

I'll try to do it in the near future :)