usearch_global command eats my sample IDs

marchoeppner commented 2 months ago

Hi,

I am recreating a pipeline/workflow I found a little while ago, using Vsearch, which included an interesting "trick" to get the OTU counts per sample via usearch_global for all samples at once. However, I have noticed that usearch_global messes up my sample IDs if these contain a dash/hyphen - maybe because the way I am doing this isn't even considered by the developer.

The basic logic goes as follows:

(trimming, primer site remove outside of Vsearch)

For each library:

vsearch --fastq_merge $fwd --reverse $rev \\
    --fastqout $merged \\
    --threads ${task.cpus} \\
    --fastq_eeout \\
    -relabel ${meta.sample_id}.

Note that I am attaching the sample ID to the fastq file via "-relabel"

vsearch -fastq_filter $fq \\
    -fastq_maxee_rate 0.1 \\
    -relabel Filtered \\
    -threads ${task.cpus} \\
    -fastaout $filtered

And then for all samples combined:

-> Dreplicate -> Cluster Unoise -> Uchime3 Denovo -> Cluster size

I then proceed to quantify my samples against the OTU set as follows:

vsearch --usearch_global $fastq \\
    -threads ${task.cpus} \\
    -db $db \\
    -otutabout $tabbed

where $fastq is the combined set of all reads as emitted by the individual -fastq_merge steps.

And this it where it does wrong, because --usearch_global clips my sample IDs:

Fastq header after "-fastq_merge": @MS-A1.1;ee=0.03493

Which --usearch_global turns into "MS", i.e. deletes everything after the "-"

Since I have many samples with similar names (MS-A1, MS-A2, etc) I am ending up with only a single count column "MS".

#OTU ID MS \\
OTU_1   188100 \\
OTU_10  31877 \\
OTU_100 24 \\

Bug, feature or am I not supposed to do it this way? ;)

marchoeppner commented 2 months ago

...and just adding "--sample SAMPLE_ID" did the job...closing:

vsearch --fastq_merge $fwd --reverse $rev \\
    --fastqout $merged \\
    --threads ${task.cpus} \\
    --fastq_eeout \\
    -relabel ${meta.sample_id}. 
    --sample ${meta.sample_id}

torognes commented 2 months ago

I am glad to hear that you found a solution!

frederic-mahe commented 2 months ago

As a follow up:

nine tests added (see https://github.com/frederic-mahe/vsearch-tests/commit/b52115b5e053e9ee7ec5fa876eef40ad08043843)
documentation modified to make it easier to discover --sample (see 31b328ab59cc142f4f9cb080d7cf16c410e4f45b)
--otutabout's behavior when --sample is missing remains undocumented (use sequence identifier, buts truncates after most punctuation characters, except _ )

torognes / vsearch

usearch_global command eats my sample IDs #558