rdpstaff / AlignmentTools

Tools for pairwise sequence comparison, distance calculation, and hidden markov model sequence scoring (using HMMER3 models). Many RDP Projects require this package.
GNU General Public License v3.0
1 stars 4 forks source link

AlignmentTools.jar pairwise-knn output #1

Open sheikki opened 8 years ago

sheikki commented 8 years ago

I'm classifying representative sequences of quality controlled and clustered 16S reads with command:

java -jar AlignmentTools.jar pairwise-knn query.fq db.fa

The db file is unaligned prokaryotic subset of RDP 11.4 clustered at 99% (with some sequence length thresholds).

Is this a sensible way to assign taxonomy to my representative sequences?

In output, I see lines like: @650A9:00200:00424 1 + 155 1.000 0 34 34 0 83 S004055894 Listeria monocytogenes; CA5 Lineage=Root;rootrank;Bacteria;domain;Firmicutes;phylum;Bacilli;class;Bacillales;order;Listeriaceae;family;Listeria;genus

As far as I can tell it's QID KNEIGHBOURS STRAND SCORE %ID QSTART QEND QEND QSTART SSTART SID. Is this the correct interpretation? Why is it that the QSTART and QEND values are displayed twice?

rdpstaffmsu commented 8 years ago

Hi, sheikki,

The columns definition is in the header of the output file:

seqname k orientation score ident query_start query_end query_length

ref_start ref_end ref_seqid ref_desc

Is "@650A9:00200:00424" a sequence of length 34? If so, this assignment might be your best bet, but it is too short to be reliable.

Benli Chai

RDP Staff

On Wed, May 11, 2016 at 5:48 AM, sheikki notifications@github.com wrote:

I'm classifying representative sequences of quality controlled and clustered 16S reads with command:

java -jar AlignmentTools.jar pairwise-knn query.fq db.fa

The db file is unaligned prokaryotic subset of RDP 11.4 clustered at 99% (with some sequence length thresholds).

Is this a sensible way to assign taxonomy to my representative sequences?

In output, I see lines like:

@650A9:00200:00424 1 + 155 1.000 0 34 34 0 83 S004055894 Listeria monocytogenes; CA5 Lineage=Root;rootrank;Bacteria;domain;Firmicutes;phylum;Bacilli;class;Bacillales;order;Listeriaceae;family;Listeria;genus

As far as I can tell it's QID KNEIGHBOURS STRAND SCORE %ID QSTART QEND QEND QSTART SSTART SID. Is this the correct interpretation? Why is it that the QSTART and QEND values are displayed twice?

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/rdpstaff/AlignmentTools/issues/1

RDP Staff Ribosomal Database Project Center for Microbial Ecology Michigan State University 567 Wilson Rd. Room 2225 A East Lansing, MI 48824 (517) 353-3842

sheikki commented 8 years ago

Thank you for the reply. Oddly, in my alignment file, ref_start value is always zero. A few examples:

@650A9:00007:00316  1   -   265 0.940   0   72  72  0   427 S001099040  Bacillus subtilis; XN-80-5  Lineage=Root;rootrank;Bacteria;domain;Firmicutes;phylum;Bacilli;class;Bacillales;order;Bacillaceae 1;family;Bacillus;genus
>-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------TGAGCAACATCTTGCACGGTACTGACT-ACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATAC----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>CACGTGGGTAACCTGCCTGTAAGACTGGGATAACTCCGGGAAACCGGGGCTAATACCGGATGGTTGTTTGAACCGCATGGTTCAGACATAAAAGGTGGCTTCGGCTACCACTTACAGATGGACCCGCGGCGCATTAGCTAGTTGGTGAGGTAACGGCTCACCAAGGCAACGATGCGTAGCCGACCTGAGAGGGTGATCGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAGGGAATCTTCCGCAATGGACGAAAGTCTGACGGAGCAACGCCGCGTGAGTGATGAAGGTTTTCGGATCGTAAAGCTCTGTTGTTAGGTAAGAACAAGTGCCGTTCAAATA-GGGCGGCACCTTG-ACGGTAC---CTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCCCGGCTCAACCGGGGAGGGTCATTGGAAACTGGGGAACTTGAGTGCAGAAGAGGAGAGTGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGACTCTCTGGTCTGTAACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAGTGCTAAGTGTTAGGGGGTTTCCGCCCCTTAGTGCTGCAGCTAACGCATTAAGCACTCCGCCTGGGGAGTACGGTCGCAAGACTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCTCTGACAATCCTAAGAGATAGGACGTCCCCTTCGGGGCAAGGTGACAGGTGGTGGCATTAGGAAGACAAGTCGTTCAATAAGCGGCACTTGACGGTACTACCAGAAAGGCCACGCTAACTACGTGCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTGTCGGAATATTGGGCGTAAAGGGCTCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCCCGGCTCAACCGGGGAGGGTCATTGGAAACTGGGGAACTTGAGTGCAGAAGAGGAGAGTGGAATTTCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGACTCTCTGGTCTGTAACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAGTGCTAAGTGTTAGGGGGTTTCCGCCCCTTAGTGCTGCAGCTAACGCATTAAGCACTCCGCCTGGGGAGTACGGTCGCAAGACTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCTCTGACAATCCTAGAGATAGGACGTCCCCTTCGGGGGCAGAGTGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGATCTTAGTTGCCAGCATTCAGTTGGGCACTCTAAGGTGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTGCTACAATGGGCAGAACAAAGGGCAGCGAAACCGCGAGGTTAAGCCAATCCCACAAATCTGTTCTCAGTTCGGATCGCAGTCTGCAACTCGACTGCGTGAAGCTGGAATCGTTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACACACCG
@650A9:00009:00308  1   -   449 1.000   0   102 102 0   515 S003301453  Bacillus cereus; B16    Lineage=Root;rootrank;Bacteria;domain;Firmicutes;phylum;Bacilli;class;Bacillales;order;Bacillaceae 1;family;Bacillus;genus
>---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ACTCTGGTTGTTAGGG-AGAACAAGTAGCTAG-T-AATAGCTGGCACCTTGACGGTACCTAA-CAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATAC-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>TATTTGGGCGGGGGGGGGCCTATCATGCAGTCGAGCGAATGGATTAAGAGCTTGCTCTTATGAAGTTATCGGCGGACGGGTGAGTAACACGTGGGTAACCTGCCCATAAGACTGGGATAACTCCGGGAAACCGGGGCTCTAATACCGGATAACATTTTGAACCGCATGGTTCGAAATTGAAAGGCGGCTTCGGCTGTCACTTATGGATGGACCCGCGTCGCATTAACTAGTTGGTGAGGTAACGGCTCACCAAGGCAACGATGCGTAGGCGACCTGAGAGGGTGATCGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAGGGAATCTTCCGCAATGGACGAAAGTCTGACGGAGCAACGCCGCGTGAGTGATGAAGGCTTTCGGGTCGTAAAACTCT-GTTGTTAGGGAAGAACAAGT-GCTAGTTGAATAGCTGGCACCTTGACGGTACCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGCAGGTGGTTTCTTAAGTCTGATGTGGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAGACTTGAGTGCAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGAGATATGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGTCTGTAACTGACACTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAGTGCTAAGTGTTAGAGGGTTTCCGCCCTTTAGTGCTGAAGTTAACGCATTAAGCACTCCGCCTGGGGAGTACGGCCGCAAGGCTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTAATTCGAAGCAACGCGAAGAACCCTACCAGGTCTTGACATCCTCTGAAACCCTAGAGATAGGGCTTCTCCTTCGGGAGCAGAGTGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGTTAAGTCCGCAACGAGCGCAACCCTTGATCTTAGTTGCCATCATTAAGTTGGGCACTCTAAGGTGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTGCTACAATGGACGGTACAAAGAGCTGCAAGACCGCGAGGTGGAGCTATTCTCATAAAACCGTTCTCAGTTCGGATTGTAGGCTGCAACTCGCCTACATGAAGCTGGAATCGCTAGTAATCGCGGATCAGGTTACCGCGGTGAATACGTTCCCGGGCCTTGTACACACCTCCCGTCACACCACGAGAGTTTGTAACACCCGAAGTCGGTGGGGTAACCTTTTGGGAGCCAGCCGGCCTAAAGGGGGAGAAAG