qunfengdong / BLCA

34 stars 12 forks source link

How do I train the classifier with my own database of fasta files? #2

Closed TurbulentCupcake closed 5 years ago

TurbulentCupcake commented 7 years ago

I tried to do the following: python BLCA/1.subset_db_acc.py -d 'training_BLCA.fa' where training_BLCA.fa is my custom training set. But when I execute the above command, it connected to the NCBI database.

Can you help me out with this?

Thanks

yingeddi2008 commented 7 years ago

Hi,

Thanks for using our software, and per your request, I have already updated the read me page on github. 1.subset_db_acc.py script is only used for formatting the NCBI 16s database, so it is no use for customizing your own database.

To format your own database, please follow the instruction on the updated read me page on github. It requires a fasta file and a taxonomy file in certain format. No script is needed unless you need help formatting the taxonomy file.

Please feel free to contact me if there is any issue,

Huaiying (Eddi) Lin

On May 31, 2017, at 13:09, Adithya Murali notifications@github.com wrote:

I tried to do the following: python BLCA/1.subset_db_acc.py -d 'training_BLCA.fa' where training_BLCA.fa is my custom training set. But when I execute the above command, it connected to the NCBI database.

Can you help me out with this?

Thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

TurbulentCupcake commented 7 years ago

How are you making the taxonomy file in step 2?

yingeddi2008 commented 7 years ago

HI Adithya,

Since you are making your own database, it is assumed that the taxonomy of all sequences in your database are known. Say you want to include E coli in your database, you can find the fasta sequence here: https://www.ncbi.nlm.nih.gov/nuccore/NR_024570.1, and its taxonomy here: https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=562. Please make sure the format is the same as in https://github.com/qunfengdong/BLCA#training-your-own-database.

Hope this will help you, please let me know if there is any more questions,

Huaiying

On Thu, Jun 1, 2017 at 9:44 PM, Adithya Murali notifications@github.com wrote:

How are you making the taxonomy file in step 2?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/qunfengdong/BLCA/issues/2#issuecomment-305666511, or mute the thread https://github.com/notifications/unsubscribe-auth/AHCP0w4WOQ2cdqzL9OQ7su_GWJ8cSSk3ks5r_2kWgaJpZM4Nr7Wb .

TurbulentCupcake commented 7 years ago

Hi,

I'm trying to classify some sequences to their taxonomy, but they all returned an output file that looks like this :

AJ000684|S000004347     Unclassified

The FASTA file being trained is of the following format:

>AJ000684|S000004347    Root;Bacteria;"Actinobacteria";Actinobacteria;Actinobacteridae;Actinomycetales;Corynebacterineae;Mycobacteriaceae;Mycobacterium
gaacgctggcggcgtgcttaacacatgcaagtcgaacggaaaggtctcttcggagatactcgagtggcgaacgggtgagtaacacgtgggtaatctgccctgcacatcgggataagcctgggaaactgggtctaataccgaataggacctcgaggcgcatgccttgtggtggaaagcttttgcggtgtgggatgggcccgcggcctatcagcttgttggtggggtgacggcctaccaaggcgacgacgggtagccggcctgagagggtgtccggccacactgggactgagatacggcccagactcctacgggaggcagcagtggggaatattgcacaatgggcgcaagcctgatgcagcgacgccgcgtgggggatgacggncttcgggttgtaaacctctttcagcagggacgaagcgcaagtgacggtacctgcagaagaagcaccggccaactacgtgccagcagccgcggtaatacgtagggtgcgagcgttgtccggaattactgggcgtaaagagctcgtaggtggtttgtcgcgttgttcgtgaaaaccgggggcttaaccctcggcgtgcgggcgatacgggcagactggagtactgcaggggagactggaattcctggtgtagcggtggaatgcgcagatatcaggaggaacaccggtggcgaaggcgggtctctgggcagtaactgacgctgaggagcgaaagcgtggggagcgaacaggattagataccctggtagtccacgccgtaaacggtgggtactaggtgtgggtttccttccttgggatccgtgccgtagctaacgcattaagtaccccgcctggggagtacggccgcaaggctaaaactcaaaggaattgacgggggcccgcacaagcggcggagcatgtggattaattcgatgcaacgcgaagaaccttacctgggtttgacatgcacaggacgccggcagagatgtcggttcccttgtggcctgtgtgcaggtggtgcatggctgtcgtcagctcgtgtcgtgagatgttgggttaagtcccgcaacgagcgcaacccttgtctcatgttgccagcgggtaatgccggggactcgtgagagactgccggggtcaactcggaggaaggtggggatgacgtcaagtcatcatgccccttatgtccagggcttcacacatgctacaatggccggtacaaagggctgcgatgccgcaaggttaagcgaatccttttaaagccggtctcagttcggatcggggtctgcaactcgaccccgtgaagtcggagtcgctagtaatcgcagatcagcaacgctgcggtgaatacgttcccgggccttgtacacaccgcccgtcacgtcatgaaagtcggtaacacccgaagccagtggcctaacctttgggagggagctgtcgaaggtgggatcggcgattgggacgaagtcgt

The taxonomy file for the same training set looks as follows:

AJ000684|S000004347     genus:Mycobacterium;family:Mycobacteriaceae;suborder:Corynebacterineae;order:Actinomycetales;subclass:Actinobacteridae;class:Actinobacteria;phylum:"Actinobacteria";domain:Bacteria

I tried to use a FASTA formatted file that resembles the type noted in readme by leaving the taxonomy out of the training set, as follows :

>AJ000684|S000004347
gaacgctggcggcgtgcttaacacatgcaagtcgaacggaaaggtctcttcggagatactcgagtggcgaacgggtgagtaacacgtgggtaatctgccctgcacatcgggataagcctgggaaactgggtctaataccgaataggacctcgaggcgcatgccttgtggtggaaagcttttgcggtgtgggatgggcccgcggcctatcagcttgttggtggggtgacggcctaccaaggcgacgacgggtagccggcctgagagggtgtccggccacactgggactgagatacggcccagactcctacgggaggcagcagtggggaatattgcacaatgggcgcaagcctgatgcagcgacgccgcgtgggggatgacggncttcgggttgtaaacctctttcagcagggacgaagcgcaagtgacggtacctgcagaagaagcaccggccaactacgtgccagcagccgcggtaatacgtagggtgcgagcgttgtccggaattactgggcgtaaagagctcgtaggtggtttgtcgcgttgttcgtgaaaaccgggggcttaaccctcggcgtgcgggcgatacgggcagactggagtactgcaggggagactggaattcctggtgtagcggtggaatgcgcagatatcaggaggaacaccggtggcgaaggcgggtctctgggcagtaactgacgctgaggagcgaaagcgtggggagcgaacaggattagataccctggtagtccacgccgtaaacggtgggtactaggtgtgggtttccttccttgggatccgtgccgtagctaacgcattaagtaccccgcctggggagtacggccgcaaggctaaaactcaaaggaattgacgggggcccgcacaagcggcggagcatgtggattaattcgatgcaacgcgaagaaccttacctgggtttgacatgcacaggacgccggcagagatgtcggttcccttgtggcctgtgtgcaggtggtgcatggctgtcgtcagctcgtgtcgtgagatgttgggttaagtcccgcaacgagcgcaacccttgtctcatgttgccagcgggtaatgccggggactcgtgagagactgccggggtcaactcggaggaaggtggggatgacgtcaagtcatcatgccccttatgtccagggcttcacacatgctacaatggccggtacaaagggctgcgatgccgcaaggttaagcgaatccttttaaagccggtctcagttcggatcggggtctgcaactcgaccccgtgaagtcggagtcgctagtaatcgcagatcagcaacgctgcggtgaatacgttcccgggccttgtacacaccgcccgtcacgtcatgaaagtcggtaacacccgaagccagtggcctaacctttgggagggagctgtcgaaggtgggatcggcgattgggacgaagtcgt

I ran into the same problem.

Can you help me out with this? Could it have something to do wiith how the files are formatted?

Thanks

yingeddi2008 commented 7 years ago

Hi Adithya,

Thanks for using our software. There problem could be that you have domain instead of superkingdom in the taxonomy file. Could you do a simple substitution and let me know how it turns out?

Huaiying

From: Adithya Murali notifications@github.com Reply-To: qunfengdong/BLCA reply@reply.github.com Date: Wednesday, June 21, 2017 at 5:33 PM To: qunfengdong/BLCA BLCA@noreply.github.com Cc: yingeddi2008 ying.eddi2008@gmail.com, Comment comment@noreply.github.com Subject: Re: [qunfengdong/BLCA] How do I train the classifier with my own database of fasta files? (#2)

Hi,

I'm was trying to classify some sequences to their taxonomy but they all returned an output file that looks like this :

AJ000684|S000004347 Unclassified

The FASTA file being trained is of the following format:

AJ000684|S000004347 Root;Bacteria;"Actinobacteria";Actinobacteria;Actinobacteridae;Actinomycetales;Corynebacterineae;Mycobacteriaceae;Mycobacterium

gaacgctggcggcgtgcttaacacatgcaagtcgaacggaaaggtctcttcggagatactcgagtggcgaacgggtgagtaacacgtgggtaatctgccctgcacatcgggataagcctgggaaactgggtctaataccgaataggacctcgaggcgcatgccttgtggtggaaagcttttgcggtgtgggatgggcccgcggcctatcagcttgttggtggggtgacggcctaccaaggcgacgacgggtagccggcctgagagggtgtccggccacactgggactgagatacggcccagactcctacgggaggcagcagtggggaatattgcacaatgggcgcaagcctgatgcagcgacgccgcgtgggggatgacggncttcgggttgtaaacctctttcagcagggacgaagcgcaagtgacggtacctgcagaagaagcaccggccaactacgtgccagcagccgcggtaatacgtagggtgcgagcgttgtccggaattactgggcgtaaagagctcgtaggtggtttgtcgcgttgttcgtgaaaaccgggggcttaaccctcggcgtgcgggcgatacgggcagactggagtactgcaggggagactggaattcctggtgtagcggtggaatgcgcagatatcaggaggaacaccggtggcgaaggcgggtctctgggcagtaactgacgctgaggagcgaaagcgtggggagcgaacaggattagataccctggtagtccacgccgtaaacggtgggtactaggtgtgggtttccttccttgggatccgtgccgtagctaacgcattaagtaccccgcctggggagtacggccgcaaggctaaaactcaaaggaattgacgggggcccgcacaagcggcggagcatgtggattaattcgatgcaacgcgaagaaccttacctgggtttgacatgcacaggacgccggcagagatgtcggttcccttgtggcctgtgtgcaggtggtgcatggctgtcgtcagctcgtgtcgtgagatgttgggttaagtcccgcaacgagcgcaacccttgtctcatgttgccagcgggtaatgccggggactcgtgagagactgccggggtcaactcggaggaaggtggggatgacgtcaagtcatcatgccccttatgtccagggcttcacacatgctacaatggccggtacaaagggctgcgatgccgcaaggttaagcgaatccttttaaagccggtctcagttcggatcggggtctgcaactcgaccccgtgaagtcggagtcgctagtaatcgcagatcagcaacgctgcggtgaatacgttcccgggccttgtacacaccgcccgtcacgtcatgaaagtcggtaacacccgaagccagtggcctaacctttgggagggagctgtcgaaggtgggatcggcgattgggacgaagtcgt

The taxonomy file for the same training set looks as follows:

AJ000684|S000004347 genus:Mycobacterium;family:Mycobacteriaceae;suborder:Corynebacterineae;order:Actinomycetales;subclass:Actinobacteridae;class:Actinobacteria;phylum:"Actinobacteria";domain:Bacteria

I tried to use a FASTA formatted file that resembles the type noted in readme by leaving the taxonomy out of the training set, as follows :

AJ000684|S000004347

gaacgctggcggcgtgcttaacacatgcaagtcgaacggaaaggtctcttcggagatactcgagtggcgaacgggtgagtaacacgtgggtaatctgccctgcacatcgggataagcctgggaaactgggtctaataccgaataggacctcgaggcgcatgccttgtggtggaaagcttttgcggtgtgggatgggcccgcggcctatcagcttgttggtggggtgacggcctaccaaggcgacgacgggtagccggcctgagagggtgtccggccacactgggactgagatacggcccagactcctacgggaggcagcagtggggaatattgcacaatgggcgcaagcctgatgcagcgacgccgcgtgggggatgacggncttcgggttgtaaacctctttcagcagggacgaagcgcaagtgacggtacctgcagaagaagcaccggccaactacgtgccagcagccgcggtaatacgtagggtgcgagcgttgtccggaattactgggcgtaaagagctcgtaggtggtttgtcgcgttgttcgtgaaaaccgggggcttaaccctcggcgtgcgggcgatacgggcagactggagtactgcaggggagactggaattcctggtgtagcggtggaatgcgcagatatcaggaggaacaccggtggcgaaggcgggtctctgggcagtaactgacgctgaggagcgaaagcgtggggagcgaacaggattagataccctggtagtccacgccgtaaacggtgggtactaggtgtgggtttccttccttgggatccgtgccgtagctaacgcattaagtaccccgcctggggagtacggccgcaaggctaaaactcaaaggaattgacgggggcccgcacaagcggcggagcatgtggattaattcgatgcaacgcgaagaaccttacctgggtttgacatgcacaggacgccggcagagatgtcggttcccttgtggcctgtgtgcaggtggtgcatggctgtcgtcagctcgtgtcgtgagatgttgggttaagtcccgcaacgagcgcaacccttgtctcatgttgccagcgggtaatgccggggactcgtgagagactgccggggtcaactcggaggaaggtggggatgacgtcaagtcatcatgccccttatgtccagggcttcacacatgctacaatggccggtacaaagggctgcgatgccgcaaggttaagcgaatccttttaaagccggtctcagttcggatcggggtctgcaactcgaccccgtgaagtcggagtcgctagtaatcgcagatcagcaacgctgcggtgaatacgttcccgggccttgtacacaccgcccgtcacgtcatgaaagtcggtaacacccgaagccagtggcctaacctttgggagggagctgtcgaaggtgggatcggcgattgggacgaagtcgt

I ran into the same problem.

Can you help me out with this?

Thanks

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/qunfengdong/BLCA/issues/2#issuecomment-310117437, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHCP04-8dODsAA5QtoTfkusLRY5j4T0oks5sGTfegaJpZM4Nr7Wb.

TurbulentCupcake commented 7 years ago

I had the same issue when I changed domain to superkingdom. I noticed a levels variable in your 2.blca_main.py script, within which the ranks have been described. I tried to modify this variable to include only those ranks represented in the taxonomy file but had the same issue. Is there a work-around?

yingeddi2008 commented 7 years ago

Hi,

Could you please send a test fasta, taxonomy and database fasta file, which they can reproduce this output?

Thanks,

Huaiying (Eddi) Lin

On Jun 21, 2017, at 18:22, Adithya Murali notifications@github.com<mailto:notifications@github.com> wrote:

I had the same issue when I changed domain to superkingdom. I noticed a levels variable in your 2.blca_main.py script, within which the ranks have been described. I tried to modify this variable to include only those ranks represented in the taxonomy file but had the same issue. Is there a work-around?

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/qunfengdong/BLCA/issues/2#issuecomment-310131648, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHCP0x8gIRAGok5t9NLiZHdkFcoNHXK5ks5sGUMfgaJpZM4Nr7Wb.

TurbulentCupcake commented 7 years ago

Can I get your email?

I've sent you the files at the following email: ying.eddi2008@gmail.com

Thanks

yingeddi2008 commented 7 years ago

ying.eddi2008@gmail.commailto:ying.eddi2008@gmail.com or hlin2@luc.edumailto:hlin2@luc.edu, please.

Huaiying (Eddi) Lin

On Jun 21, 2017, at 19:17, Adithya Murali notifications@github.com<mailto:notifications@github.com> wrote:

Can I get your email?

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/qunfengdong/BLCA/issues/2#issuecomment-310146501, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHCP0_74Xev69_QOHAD0emeYtgDW_xStks5sGVA0gaJpZM4Nr7Wb.

TurbulentCupcake commented 7 years ago

Hi,

I wanted to follow up on our discussion, is there a parameter that will allow me to directly filter confidence at each rank?

Thanks

yingeddi2008 commented 5 years ago

We do not have a parameter that can filter confidence at each rank.