Clarify fasta header naming with uc to biom

dridk commented 8 years ago

I trying to do a simple test , but I don't understand how fasta header are proccess. For exemple, I have One sample test.fa with the following reads :

>A_sample1
AGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACA
>A_sample2
ATGGTCGTATATATATGGTCGTATATATATGGTCGTATATATATGGTCGTATATATATGGTCGTATATATATGGTCGTATATAT
>A_sample3
ATGGTCGTGTCGTGTCGTGTCGTATATATATCGGTCGTGTCGTGTCGTGTCGTGTCGTATGTCGTGTCGTGTCGTGTCGTATAT
>A_sample4
ATGGTCGTGTCGTGTCGTGTCGTATATATATCGGTCGTGTCGTGTCGTGTCGTGTCGTATGTCGTGTCGTGTCGTGTCGTATAT
>A_sample5
ATACGTGTATGATATGCGGTGTAATACGTGTATGATATGCGGTGTAATACGTGTATGATATGCGGTGTAATACGTGTATGATAT
>A_sample6
ATACGTGTATGATATGCGGTGTAATACGTGTATGATATGCGGTGTAATACGTGTATGATATGCGGTGTAATACGTGTATGATAT
>A_sample7
AGAACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACA
>A_sample8
AGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACA
>A_sample9
AGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACA
>A_sample10
ATGGTCGTGTCGTGTCGTGTCGTATATATATCGGTCGTGTCGTGTCGTGTCGTGTCGTATGTCGTGTCGTGTCGTGTCGTATAT
>A_sample11
ATGGTCGTGTCGTGTCGTGTCGTATATATATCGGTCGTGTCGTGTCGTGTCGTGTCGTATGTCGTGTCGTGTCGTGTCGTATAT
>A_sample12
ATGGTCGTGTCGTGTCGTGTCGTATATATATCGGTCGTGTCGTGTCGTGTCGTGTCGTATGTCGTGTCGTGTCGT

I cluster them using :

vsearch --cluster_fast test.fa --id 0.97 --centroids centroids.fa --sizeout --uc test.uc --relabel_sha1 --relabel_keep

Now I want to convert them to biom using your script :

python create_otu_table_from_uc_file.py -i test.uc -o test.biom

I get the following error :

Error in uc file formating. Check for spaces in sample IDs and to make sure there is a semicolon after sample IDs.
First line with issue:
S       0       84      *       *       *       *       *       A1      *
100.0%
Writing table...

I thinks fasta header should keep a rule, but I don't know how... Could you make me a simple exemple to make me understand ? Thanks

leffj commented 8 years ago

Hi, good question. You need a string in the fasta header that includes: ';barcodelabel=SAMPLEID;’. For example:

M01918:213:000000000-AFC1C:1:1101:15775:1331 1:N:0:0;barcode=TAAATATACCCT;barcodelabel=cp83; TACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTATGTAAGACAGGTGTGAAATCCCCGGGCTTAACCTGGGAATTGCCTTTGGGACTGCATGGCTAGAGTGTGTCAGAGGGGGGTAGAATTCCAAGTGTAGCAGTGTAATGCGTAGATATGTGGGGGAATACCGATGGCGGAGGCAGCCCCCTGGGCAGATACTGACGCTCAGGCACGAAAGCCTGGGGAGCAAACA

where ‘cp83’ is the sample ID.

This formatting comes from the prep_fastq_for_uparse_paired.py script, fyi.

Jon

On Aug 22, 2016, at 1:25 PM, sacha schutz <notifications@github.com mailto:notifications@github.com> wrote:

I trying to do a simple test , but I don't understand how fasta header are proccess. For exemple, I have One sample test.fa with the following reads :

A_sample1 AGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACA A_sample2 ATGGTCGTATATATATGGTCGTATATATATGGTCGTATATATATGGTCGTATATATATGGTCGTATATATATGGTCGTATATAT A_sample3 ATGGTCGTGTCGTGTCGTGTCGTATATATATCGGTCGTGTCGTGTCGTGTCGTGTCGTATGTCGTGTCGTGTCGTGTCGTATAT A_sample4 ATGGTCGTGTCGTGTCGTGTCGTATATATATCGGTCGTGTCGTGTCGTGTCGTGTCGTATGTCGTGTCGTGTCGTGTCGTATAT A_sample5 ATACGTGTATGATATGCGGTGTAATACGTGTATGATATGCGGTGTAATACGTGTATGATATGCGGTGTAATACGTGTATGATAT A_sample6 ATACGTGTATGATATGCGGTGTAATACGTGTATGATATGCGGTGTAATACGTGTATGATATGCGGTGTAATACGTGTATGATAT A_sample7 AGAACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACA A_sample8 AGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACA A_sample9 AGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACA A_sample10 ATGGTCGTGTCGTGTCGTGTCGTATATATATCGGTCGTGTCGTGTCGTGTCGTGTCGTATGTCGTGTCGTGTCGTGTCGTATAT A_sample11 ATGGTCGTGTCGTGTCGTGTCGTATATATATCGGTCGTGTCGTGTCGTGTCGTGTCGTATGTCGTGTCGTGTCGTGTCGTATAT A_sample12 ATGGTCGTGTCGTGTCGTGTCGTATATATATCGGTCGTGTCGTGTCGTGTCGTGTCGTATGTCGTGTCGTGTCGT I cluster them using :

vsearch --cluster_fast test.fa --id 0.97 --centroids centroids.fa --sizeout --uc test.uc --relabel_sha1 --relabel_keep

Now I want to convert them to biom using your script :

uctobiom -i test.uc -o test.biom

I get the following error :

Error in uc file formating. Check for spaces in sample IDs and to make sure there is a semicolon after sample IDs. First line with issue: S 0 84 * * * * * A1 * 100.0% Writing table... I thinks fasta header should keep a rule, but I don't know how... Could you make me a simple exemple to make me understand ? Thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/leffj/helper-code-for-uparse/issues/3, or mute the thread https://github.com/notifications/unsubscribe-auth/ACqxj9IVFp30JCvEeNF83DmzOhtOu3l7ks5qiduWgaJpZM4JqGxV.

bioinfo17 commented 5 years ago

Hi,

I have an .uc file that is in the format below:

H 205 339 98.5 + 0 0 339M B3::M02542:85:000000000-BWJ73:1:1102:22965:2274 OTU_206 H 547 339 98.5 + 0 0 339M B13::M02542:85:000000000-BWJ73:1:2116:22473:4007 OTU_548 H 436 339 97.6 + 0 0 D338M B14::M02542:85:000000000-BWJ73:1:1116:19896:20825 OTU_437 H 127 339 98.8 + 0 0 339M B9::M02542:85:000000000-BWJ73:1:1118:22070:17406 OTU_128 H 200 337 99.1 + 0 0 I337M B3::M02542:85:000000000-BWJ73:1:1116:13763:3215 OTU_201 H 174 339 98.8 + 0 0 339M B15::M02542:85:000000000-BWJ73:1:1115:12758:8719 OTU_175 N * * * . * * * B6::M02542:85:000000000-BWJ73:1:1117:9645:18835 * H 137 328 99.1 + 0 0 328M11I B12::M02542:85:000000000-BWJ73:1:2103:20919:8080 OTU_138 H 443 335 100.0 + 0 0 335M4I B12::M02542:85:000000000-BWJ73:1:1103:27262:12348 OTU_444

I get the following error:

Error in uc file formating. Check for spaces in sample IDs and to make sure there is a semicolon after sample IDs. First line with issue: H 349 338 99.4 + 0 0 261MI77M B1::M02542:85:000000000-BWJ73:1:1OTU_35022:9749 1:N:0:TAGCTT

I'm finding it hard to convert the .uc file to otu table txt file. Would you be please able to modify the script, create_otu_table_from_uc_file.py for user-specific needs?

Any help will be much appreciated, thanks in advance.

leffj / helper-code-for-uparse

Clarify fasta header naming with uc to biom #3