torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

usearch_global command eats my sample IDs #558

Closed marchoeppner closed 2 months ago

marchoeppner commented 2 months ago

Hi,

I am recreating a pipeline/workflow I found a little while ago, using Vsearch, which included an interesting "trick" to get the OTU counts per sample via usearch_global for all samples at once. However, I have noticed that usearch_global messes up my sample IDs if these contain a dash/hyphen - maybe because the way I am doing this isn't even considered by the developer.

The basic logic goes as follows:

(trimming, primer site remove outside of Vsearch)

For each library:

vsearch --fastq_merge $fwd --reverse $rev \\
    --fastqout $merged \\
    --threads ${task.cpus} \\
    --fastq_eeout \\
    -relabel ${meta.sample_id}. 

Note that I am attaching the sample ID to the fastq file via "-relabel"

vsearch -fastq_filter $fq \\
    -fastq_maxee_rate 0.1 \\
    -relabel Filtered \\
    -threads ${task.cpus} \\
    -fastaout $filtered 

And then for all samples combined:

-> Dreplicate -> Cluster Unoise -> Uchime3 Denovo -> Cluster size

I then proceed to quantify my samples against the OTU set as follows:

vsearch --usearch_global $fastq \\
    -threads ${task.cpus} \\
    -db $db \\
    -otutabout $tabbed

where $fastq is the combined set of all reads as emitted by the individual -fastq_merge steps.

And this it where it does wrong, because --usearch_global clips my sample IDs:

Fastq header after "-fastq_merge": @MS-A1.1;ee=0.03493

Which --usearch_global turns into "MS", i.e. deletes everything after the "-"

Since I have many samples with similar names (MS-A1, MS-A2, etc) I am ending up with only a single count column "MS".

#OTU ID MS \\
OTU_1   188100 \\
OTU_10  31877 \\
OTU_100 24 \\

Bug, feature or am I not supposed to do it this way? ;)

marchoeppner commented 2 months ago

...and just adding "--sample SAMPLE_ID" did the job...closing:

vsearch --fastq_merge $fwd --reverse $rev \\
    --fastqout $merged \\
    --threads ${task.cpus} \\
    --fastq_eeout \\
    -relabel ${meta.sample_id}. 
    --sample ${meta.sample_id}
torognes commented 2 months ago

I am glad to hear that you found a solution!

frederic-mahe commented 2 months ago

As a follow up: