gjeunen / reference_database_creator

creating reference databases for amplicon sequencing
MIT License
21 stars 8 forks source link

How to append in-house barcode in missing_taxa to tax_assign #51

Open timz0605 opened 5 months ago

timz0605 commented 5 months ago

Hello Gert-Jan,

I am working on the same COI metabarcoding project mentioned before, but now encountering some issues with incorporating in-house generated barcodes. The in-house barcode database is generated by previous lab mates and is in qiime2 compatible formats (1 fasta sequence file and 1 taxa file). I am now thinking about how to utilize that database or convert/paste it to the CRABS database I generate. Below are a few questions I have:

1/ For the in-house generated database, some samples have identification to the species level, while some other samples are only identified to family or order level. I wonder how crabs deals with situations like this?

2/ In another discussion https://github.com/gjeunen/reference_database_creator/issues/19, you mentioned that we could try to append the in-house generated barcode to the database AFTER the tax_assign step. I was wondering that, if I were to do this, should I add all the taxonomic information to the entries missing_taxa.tsv file, make sure it follows the same format as in mergedDBinsilicoPGAtax.tsv, before I copy and paste the sequences?

3/ Also, I am a bit confused about the taxID in the example you provided in the documentation, which is accession taxID rank_1,taxID,name rank_2,taxID,name rank_3,taxID,name rank_4,taxID,name rank_5,taxID,name rank_6,taxID,name rank_7,taxID,name sequence. For in-house barcodes, what would be the taxID be?

Again, thank you so much for your time and I greatly appreciate your help!

Below is the first few lines from missing_taxa

head -5 missing_taxa.tsv AF217353 CCTATCCAGGGGCCTCGCCCATGCTGGGGGATCAGTTGACCTCGCCATATTCTCTCTTCACTTGGCTGGTGCTTCTTCCATCTTAGCCTCTATTAAATTTATTACTACAATAATAAAAATGCGAACACCTGGCATGTCCTTCGATCGCCTTCCTTTATTTGTATGGTCCGTTTTCGTCACCGCATTTCTACTTNTACTTTCCCTTCCTGTCCTAGCCGGAGCAATAACAATGCTGCTAACAGACCGGAAAATCAAAACAACTTTTTTTGACCCCGCCGGTGGGGGAGACCCCATTCTATTTCAACATCTCTTC AF335996 TCTATCTAGAGCTATTGCCCATACAGGGGCTTCCGTTGACCTAGCTATTTTTTCTCTACATTTAGCTGGAATTAGTTCAATCCTGGGAGCTGTAAATTTTATCACCACTATTATCAATATACGTTCTACTAGACTAACGTTAGAGCGAATGCCATTATTTGTATGGTCGGTATTAATTACTGCAGTTTTATTACTTTTATCATTACCAGTACTAGCTGGTGCAATTACAATATTATTAACAGACCGAAACATTAATACCTCCTTTTTTGATCCTGCTGGAGGTGGTGACCCTATCCTTTACCAACATTTATTT AF391372 TCTAGCAGGAAACTTGGCCCATGCAGGAGCTTCCGTAGACCTAACAATTTTTTCCCTGCATCTTGCCGGAGTCTCCTCAATTCTAGGGGCCATTAACTTTATTACCACTATTATCAACATGAAGCCGCCCGCCATTTCACAATACCAAACTCCTTTATTCGTTTGGGCAGTCCTGGTCACAGCCGTACTTCTACTTCTCTCCCTTCCAGTGTTGGCTGCCGGTATCACAATGCTCCTCACAGACCGAAACCTAAACACAACCTTTTTTGACCCTGCCGGAGGAGGCGACCCCATTTTGTACCAACACCTATTC AY074899 TCTATCATCTAATATTGCTCATGGAGGAGCTTCTGTTGATTTAGCTATTTTTTCTTTACATTTAGCAGGAATTTCTTCAATTTTAGGGGCAGTAAATTTTATCACTACAGTTATTAATATACGATCAACAGGAATTACATTTGATCGAATACCTCTATTTGTTTGATCAGTAGTAATTACTGCTTTATTACTTTTATTATCTCTTCCAGTATTAGCTGGAGCTATTACTATACTTTTAACAGACCGAAATCTTAATACTTCATTCTTTGACCCTGCTGGAGGAGGAGATCCTGTTTTATACCAACATTTATTT AY137472 CTTATCAAACAATATTGCCCATAATAATATTTCAGTAGATTTAACTATTTTTTCTTTACATATAGCAGGAATTTCATCAATTTTAGGAGCAATTAATTTTATTTGTACTATTATAAATATAATACCAAATAATATAAAATTAAATCAAATTCCACTATTTCCTTGATCAATTCTAATTACAGCTATTTTATTAATTTTATCTTTACCTGTATTAGCTGGTGCTATTACAATATTATTAACAGATCGTAATTTAAATACATCATTTTTTGACCCATCTGGAGGAGGAGATCCAATTTTATATCAACACTTATTT

Below is the example from mergedDBinsilicoPGAtax.tsv

head -5 mergedDBinsilicoPGAtax.tsv U04880 30224 Eukaryota Arthropoda Insecta Lepidoptera Adelidae Adela Adela_trigrapha ATTATCTTCAAATATATTCCATTCCGGAACTTCAGTAGATTTAACAATTTTTTCACTTCATTTAGCAGGAATTTCTTCAATTTTAGGAGCCATTAATTTTATTACAACAGTAATTAATATACGAACAATAAATATAATATTTGATCAAATACCATTATTTGTTTGATCTGTAGCAATTACAGCTTTATTATTATTATTATCATTACCTGTATTAGCAGGAGCTATTACAATATTATTAACAGACCGAAATTTAAATACCTCATTTTTTGATCCTATGGGTGGAGGAGATCCTATTTTATATCAACATTTATTT U04881 30230 Eukaryota Arthropoda Insecta Lepidoptera Cecidosidae Cecidoses Cecidoses_eremita TTTATCTTCTAATATTGCCCACTCAGGAAATTCAGTAGATTTAGCTATCTTTTCATTACATTTAGCCGGTATTTCTTCAATTTTAGGAGCTATAAATTTTATTACAACAGTAATTAACATGCGACCAATTAATATAACACTAAATCAAATACCCTTATTTGTCTGAGCCACTATCATTACAGCATTTTTACTTTTACTATCTTTACCAGTTTTAGCGGGAGCTGTTACAATACTATTAACAGATCGTAATTTAAATACTTCATTTTTTGACCCTTCCGGAGGAGGAGATCCCATTTTATACCAACACTTATTT M93388 6192 Eukaryota Platyhelminthes Trematoda Plagiorchiida Fasciolidae Fasciola Fasciola_hepatica TCTTTCTAGGTTGGATTATTCTAGGTGGGGGGTTGATTTTTTGATGTTTTCTCTTCATTTGGCTGGGGTTTCTAGTCTTTTAGGTTCTATTAAATTTATTTGTACTATTTTAGAGGTTATGTTGGATGAGGGTACGGGTCGGCATAGTATTTTGGTTTGGGCTTATCTGTTTACTTCTGTTTTATTGTTGTTGTCATTGCCTGTTTTGGCTGCTGCTATTACTATGTTGTTATTTGATCGTAAATTTGGCTCTGCTTTTTTTGATCCTATGGGAGGTGGGGATCCTGTTTTATTTCAGCATTTGTTT L14945 7376 Eukaryota Arthropoda Insecta Diptera Calliphoridae Lucilia Lucilia_illustris TTTATCATCTAATATCGCTCATGGAGGAGCTTCTGTTGATTTAGCTATTTTTTCTCTTCATTTAGCAGGAATTTCATCAATTTTAGGAGCTGTAAATTTTATTACTACAGTTATTAATATACGATCAACAGGGATTACTTTTGATCGAATGCCTTTATTCGTTTGATCAGTAGTAATTACAGCTTTATTACTTTTATTATCATTACCAGTATTAGCTGGAGCTATTACTATACTTTTAACTGATCGAAATCTTAATACTTCATTCTTTGACCCAGCAGGAGGAGGAGACCCAATTTTATACCAACATTTATTT L14947 13632 Eukaryota Arthropoda Insecta Diptera Calliphoridae Lucilia Lucilia_sericata TCTATCTTCTAATATTGCTCATGGAGGAGCTTCTGTTGATTTAGCTATTTTCTCTCTTCATTTAGCAGGAATTTCTTCAATTTTAGGAGCTGTAAATTTTATTACTACAGTTATTAATATACGATCAACAGGAATTACTTTTGATCGAATACCTTTATTTGTTTGATCAGTAGTAATTACAGCTTTATTACTTTTATTATCATTACCAGTATTAGCAGGAGCTATTACAATACTTTTAACAGACCGAAATCTTAATACATCATTCTTTGACCCTGCAGGAGGAGGTGATCCAATTTTATACCAACATTTATTT

timz0605 commented 5 months ago

Hello again,

I think I might have figured out the formatting by testing on a small portion of the sequences. Now for dereplication, I am encountering some issues with formatting and delimitation, which might be indicated by the error code:

Command I am using: crabs dereplicate -i merged_taxa.tsv -o merged_taxa_test_derep.tsv -m uniq_species dereplicating merged_taxa.tsv, keeping all unique sequences per species species information found at position 7, starting dereplication process 11%|███████▊ | 27241/243916 [00:00<00:00, 12967544.58it/s] Traceback (most recent call last):

File "/home/mjzhang/reference_database_creator/crabs", line 1462, in main()

File "/home/mjzhang/reference_database_creator/crabs", line 1459, in main args.func(args)

File "/home/mjzhang/reference_database_creator/crabs", line 578, in dereplicate fail, orig_count, derep_count = derep_uniq(INPUT, OUTPUT, RANKS) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/home/mjzhang/reference_database_creator/function/module_db_cleanup.py", line 96, in derep_uniq spec = lines.split('\t')[species_position]


IndexError: list index out of range

HOWEVER, when I changed the method to `-m strict`, it seems to work
`
strict dereplication of merged_taxa.tsv, only keeping unique sequences
100%|█████████████████████████████████████████████████████████████████████| 243916/243916 [00:00<00:00, 49506791.89it/s]
found 627 sequences in merged_taxa.tsv
written 512 sequences to merged_taxa_test_derep.tsv
`

I was wondering what is the issue here?

The example file is attached: [merged_taxa.txt](https://github.com/gjeunen/reference_database_creator/files/14330204/merged_taxa.txt)
gjeunen commented 5 months ago

Hello @timz0605,

Quite a few questions in the posts above. I'll try my best to answer them, but please let me know if I've overlooked anything or if something isn't clear.

1) For in-house barcodes that contain an NCBI accession number, CRABS can use the accession number to process the sequences. For in-house barcodes without an NCBI accession number, CRABS can use the species name that should be within the sequence. CRABS will generate a unique identifier that will act as an accession number and uses the species name information to find the NCBI taxonomic ID. In case the species is not listed on NCBI, you will need to manually enter the necessary information into the names.dmp and nodes.dmp files that are downloaded by crabs db_download -s taxonomy.

2) Indeed, the easiest way forward for in-house barcodes, is to generate a CRABS reference database with the online reference sequences. Then, once you have your final ref DB, just append your own barcodes to the document and use this as the reference database for taxonomy assignment.

3) The taxID is the NCBI taxonomic ID, which you can find here: https://www.ncbi.nlm.nih.gov/taxonomy. All taxonomic names, species and otherwise, will have a unique identifier that can be looked up on the website. Alternatively, you can search through the names.dmp file to find the taxonomic ID. This file is a simple tab-delimited text file and can be opened with a text editor such as Sublime Text or TextEdit.

4) The error you are encountering is due to the fact that not all sequences have information up to species level. I'm assuming those are your in-house barcodes that are causing the problem? If your in-house barcodes are assigned to, let's say, family level, you will need to provide CRABS with the taxID's associated with the undefined genus and species taxonomic ID's for your family. You can look up a sequence on NCBI that is deposited as Family X sp. to determine how they have resolved the genus and species taxonomic ID. In case those undefined taxonomic IDs do not exist on NCBI, please see answer 1 on how to proceed. The reason why -m strict is executing properly, is because it does not take taxonomic information into account, solely the sequence. However, this method is not recommended if you expect some species might contain the same sequence, i.e., the amplicon does not resolve all genera to species level.

Best, Gert-Jan

timz0605 commented 5 months ago

Hello Gert-Jan,

Thank you so much! This is really helpful information! I will spend some time digging into the details and playing around with the codes.

Best,