gjeunen / reference_database_creator

creating reference databases for amplicon sequencing
MIT License
25 stars 8 forks source link

Steps 5 onwards online for NCBI databases? #5

Closed laninsky closed 1 year ago

laninsky commented 2 years ago

Kia ora GJ - just checking in about steps 5 onwards - these don't seem to work on the MitoFish database because step 5 doesn't seem to work (generates a blank input.tsv, presumably because the database doesn't follow the structure of the NCBI database?).

gjeunen commented 2 years ago

Hello laninsky,

Just checking that with step 5 you are referring to generating a taxonomic lineage? If so, this should work for the MitoFish database. Could you check if the output from step 4 (in silico PCR) did not result in an empty document? If the output from step 4 is not empty, might you be able to paste the first couple of lines of that document as an attachment to your response? That way I can figure out what is going on.

Cheers, GJ

laninsky commented 2 years ago
crabs insilico_pcr --input complete_partial_mitogenomes.fa --output complete_partial_mitogenomes1.fa --fwd GTCGGTAAAACTCGTGCCAGC --rev CATAGTGGGGTATCTAATCCCAGTTTG --error 4.5

A less on complete_partial_mitogenomes1.fa:

>gb|KY213961|Lates japonicus ([] ["Centropomidae;"])
CACCGCGGTTATACGAGAGGCCCAAGTTGACAAACCACGGCGTAAAGAGTGGTTAAGGAAACTTAAAACTAAAGTCGAACGCTCCCTAGAGCTGTTATACGTGTACGAGAGTACGAAGCTCAACCACGAAAGTGACTTTATAACCCCTGACTCCACGAAAGCTGAGAAA
>gb|KY213962|Lates calcarifer ([] ["Centropomidae;"])
CACCGCGGTTATACGAGAGGCCCAAGCTGACAAACCACGGCGTAAAGAGTGGTTAAGGAAACTTGAAACTAAAGTCGAACGCTCTCTTCGAGCTGTTATACGCGTACGAGACTACGAAGCTCAGCCACGAAAGTGACTTTACAACCCCTGACCCCACGAAAGCTAAGAAA
>gb|KY213963|Lates niloticus ([] ["Centropomidae;"])
CACCGCGGTTATACGAGAGGCCCAAGTTGACAAACCACGGCGTAAAGAGTGGTTAAGGAAATCTGAAACTAAAGCCGAACGCTCCACAGAGCTGTTATACGTGTACGAGAGTACGAAGCTCAGCCACGAAAGTGACTTTATGCCTCCTGATTCCACGAAAGCTGAGAAA
>gb|KY213964|Psammoperca waigiensis ([] ["Centropomidae;"])
CACCGCGGTTATACGAGAGACCCAAGTTGACAAATAACGGCGTAAAGCGTGGTTAAGGAATATACAAAATAAAGTCGAACCCTGTCAAAGCTGTCATACGCGTACGAGAGCATGAAGCCCAAATACGAAAGTAACTTTACTACCCCCGACGCCACGAAAGCCAAGAAA

Code we tried for the taxonomic lineage step:

crabs db_download --source taxonomy
crabs assign_tax --input complete_partial_mitogenomes1.fa --output output.tsv --acc2tax nucl_gb.accession2taxid --taxid nodes.dmp --name names.dmp

nodes.dmp and names.dmp are not empty, in case that helps with the trouble-shooting

Tagging in @bigdog-836 so he is in the loop too

gjeunen commented 2 years ago

I see... The reason why 'crabs assign_tax' code is not working, is because the header information for each sequence is different from the expected crabs format. What should happen when you download the MitoFish database through the crabs db_download function, is that it will format the downloaded fasta file automatically, which is a simple 2-line fasta format with the header as the accession number (please see attached an example of the MitoFish databases downloaded through crabs).

Screen Shot 2022-05-23 at 1 59 26 PM

A couple of reasons why the format might not match the crabs format (other than a bug in the code): 1) sequences were not downloaded through the crabs db_download function 2) the parameter --keep_original in crabs db_download was set to "yes" and the in silico PCR was conducted on this file

Before I dig into the code to find the bug, can you let me know if any of these 2 options apply to your data?

laninsky commented 2 years ago

Oh thanks GJ - yup 100% the second option, sorry! Assumed everything was formatted the correct way, whether the pared down file or the "complete mitogenome" file. We'll try again using the "modified" file, not the "original file" (I originally chose to use the original data because we lost less sequences due to the lack of primers on each end)

gjeunen commented 2 years ago

Great that it is solved! Though the number of sequences that are retained shouldn't be different between the original file and the formatted file. The only thing that crabs will do is to retrieve the accession number from each header, discard all other header info, and place the sequence on a single line. Can you check the number of sequences that are in the original file and the formatted file? If these are different, I'll try to look into it a bit more as to why the sequences were discarded.

laninsky commented 2 years ago

Ah - OK. The apparent difference in the file size was probably just the difference in header lengths. We bricked the original file (writing the output over it) so cannot go back to that easily unfortunately, but I think it just sounds like we were going slightly off-trail! (i.e. not worth your time to troubleshoot).

gjeunen commented 2 years ago

To not having to rerun your steps, you can import the output from step 4 using the db_import module and let crabs know where the accession number can be found. This function will format your file into the crabs format. From there you can run the assign_tax module.

laninsky commented 2 years ago

Thanks GJ! I think we might just run through the steps again - good practice - the nice thing is the mitofish database is quick!