qiyunzhu / woltka

Woltka: a versatile meta'omic data classifier
BSD 3-Clause "New" or "Revised" License
68 stars 24 forks source link

different results between classify directly and collapse after #167

Closed ProsperP closed 1 year ago

ProsperP commented 2 years ago

Hi, I was trying to use woltka classify with stratification. However, I found that there is a difference between classify to a certain level (say 'ko') directly and classify at uniref level first then collapse to ko level later. I'm wondering why these two methods get different results? To make my question more clear, here's my data and some code:

C-27.sam.gz


WOLDB=/path/to/wol_database/directory
OUTDIR=./profile_test

# $ ls ${WOLDB}
# function  genomes  proteins  taxonomy  trees
# $ woltka --version
# woltka, version 0.1.4

sample='C-27'
file_path=./C-27.sam.gz
gunzip ${file_path}
file_path=./C-27.sam

## generate map data
mkdir -p ${OUTDIR}/${sample}/mapdir
woltka classify \
    --no-demux \
    --to-tsv \
    --input  ${file_path} \
    --rank   none \
    --outmap ${OUTDIR}/${sample}/mapdir \
    --output ${OUTDIR}/${sample}/ogu.tsv

## classify at uniref level first
mkdir ${OUTDIR}/${sample}/function
woltka classify \
    --no-demux --to-tsv \
    --input    ${file_path} \
    --coords   ${WOLDB}/proteins/coords.txt.xz \
    --map      ${WOLDB}/function/uniref/uniref.map.xz \
    --rank     uniref \
    --stratify ${OUTDIR}/${sample}/mapdir \
    --output   ${OUTDIR}/${sample}/function/ogu_uniref.tsv &

## classify to ko level directly
woltka classify \
    --no-demux --to-tsv \
    --input    ${file_path} \
    --coords   ${WOLDB}/proteins/coords.txt.xz \
    --map      ${WOLDB}/function/uniref/uniref.map.xz \
    --map      ${WOLDB}/function/kegg/ko.map.xz \
    --rank     ko \
    --stratify ${OUTDIR}/${sample}/mapdir \
    --output   ${OUTDIR}/${sample}/function/ogu_ko.tsv &

wait
gzip ${file_path}

## collapse stratified uniref to ko level
woltka tools collapse \
    -f 2 \
    -m ${WOLDB}/function/kegg/ko.map.xz \
    -i ${OUTDIR}/${sample}/function/ogu_uniref.tsv \
    -o ${OUTDIR}/${sample}/function/ogu_ko.from_collapse.tsv

# sort output
sort ${OUTDIR}/${sample}/function/ogu_ko.tsv > ${OUTDIR}/${sample}/function/ogu_ko.sorted.tsv 
sort ${OUTDIR}/${sample}/function/ogu_ko.from_collapse.tsv > ${OUTDIR}/${sample}/function/ogu_ko.from_collapse.sorted.tsv

gawk -F'\t' 'BEGIN{sum = 0}; NR!=1 {sum += $2}; END{print sum}' ${OUTDIR}/${sample}/function/ogu_ko.tsv
# Output: 4954
gawk -F'\t' 'BEGIN{sum = 0}; NR!=1 {sum += $2}; END{print sum}' ${OUTDIR}/${sample}/function/ogu_ko.from_collapse.tsv
# Output: 4829

## run this line manually
#diff ${OUTDIR}/${sample}/function/ogu_ko.sorted.tsv ${OUTDIR}/${sample}/function/ogu_ko.from_collapse.sorted.tsv | less
qiyunzhu commented 2 years ago

HEllo @ProsperP Thanks for noting this. I think the reason they are different is because the direct classify does not handle one-to-many mappings -- in case one UniRef entry is mapped to multiple KOs, only the first KO is considered. Whereas collapse supports one-to-many mappings, and you will get multiple KOs in the result.

These behaviors are documented here. I noticed that it isn't easy to find it. Sorry about that, I will raise it to a more visible place. Meanwhile, please check out some notes about collapse. I would recommend that you should do collapse whenever there are one-to-many mappings and you want to keep them all.

ProsperP commented 2 years ago

@qiyunzhu Thanks for your timely response. Sorry for that I didn't make what I've noticed clear.

As I understand, using collapse without option -d would inflate the total feature count (henceforth abbreviated as TFC) in case of one-to-many mappings. However, in my case above, I got a TFC of 4954 through direct classify. In contrast, I only got a TFC of 4829 by doing collapse without the -d option and even less TFC with the -d option (data not shown).

On the other hand, considering that I might have used a bad example (in case one UniRef entry is mapped to multiple KOs), I'm now trying to use a one-to-one mapping to illustrate my point. I tried direct classify at MetaCyc protein level (function/metacyc/protein.map.xz) and do collapse to MetaCyc protein level through classified orfs (proteins/coords.txt.xz), respectively. By doing direct classify, I got a TFC of 3387. And, by using collapse, I got a TFC of 1667. Shouldn't these two methods produce the same results in case of one-to-one mapping? Here's my code:

WOLDB=/path/to/wol_database/directory
OUTDIR=./profile_test

# $ ls ${WOLDB}
# function  genomes  proteins  taxonomy  trees
# $ woltka --version
# woltka, version 0.1.4

sample='C-27'
file_path=./C-27.sam.gz
gunzip ${file_path}
file_path=./C-27.sam

## generate map data
mkdir -p ${OUTDIR}/${sample}/mapdir
woltka classify \
    --no-demux \
    --to-tsv \
    --input   ${file_path} \
    --rank    none \
    --outmap  ${OUTDIR}/${sample}/mapdir \
    --output  ${OUTDIR}/${sample}/ogu.tsv

## classify to MetaCyc protein level directly
woltka classify \
    --no-demux --to-tsv \
    --input    ${file_path} \
    --coords   ${WOLDB}/proteins/coords.txt.xz \
    --map      ${WOLDB}/function/metacyc/protein.map.xz \
    --rank     protein \
    --stratify ${OUTDIR}/${sample}/mapdir \
    --output   ${OUTDIR}/${sample}/function/ogu_protein.tsv &

## classify at uniref level first
mkdir ${OUTDIR}/${sample}/function
woltka classify \
    --no-demux --to-tsv \
    --input    ${file_path} \
    --coords   ${WOLDB}/proteins/coords.txt.xz \
    --rank     none \
    --stratify ${OUTDIR}/${sample}/mapdir \
    --output   ${OUTDIR}/${sample}/function/ogu_orf.tsv &
wait
gzip ${file_path}

## collapse ogu stratified uniref to MetaCyc protein level
woltka tools collapse \
    -f 2 \
    -m ${WOLDB}/function/metacyc/protein.map.xz \
    -i ${OUTDIR}/${sample}/function/ogu_orf.tsv \
    -o ${OUTDIR}/${sample}/function/ogu_protein.from_collapse.tsv

# Compute total feature count
gawk -F'\t' 'BEGIN{sum = 0}; NR!=1 {sum += $2}; END{print sum}' ${OUTDIR}/${sample}/function/ogu_protein.tsv
# Output: 3387
gawk -F'\t' 'BEGIN{sum = 0}; NR!=1 {sum += $2}; END{print sum}' ${OUTDIR}/${sample}/function/ogu_protein.from_collapse.tsv
# Output: 1667

# sort output
sort ${OUTDIR}/${sample}/function/ogu_protein.tsv > ${OUTDIR}/${sample}/function/ogu_protein.sorted.tsv
sort ${OUTDIR}/${sample}/function/ogu_protein.from_collapse.tsv > ${OUTDIR}/${sample}/function/ogu_protein.from_collapse.sorted.tsv

## run this line manually to see the feature or the feature count between two results
#diff ${OUTDIR}/${sample}/function/ogu_protein.sorted.tsv ${OUTDIR}/${sample}/function/ogu_protein.from_collapse.sorted.tsv | less


Besides, I looked back into the alignment results and found out that in the test data sets (S01.sam.xz) supplied by woltka itself only had the unique read alignment (i.e., reads were aligned to genome only once). When S01 was analyzed using my code above, I got same TFC (166) by both methods (classify or collapse) at the MetaCyc protein level, which is expected. However, in my case, I used the bowtie2 with SHOGUN protocol (--very-sensitive --no-head --no-unal --mm -k 16 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min "L,0,-0.05") to perform the alignment step.

So, I think the difference between those two methods might be caused, in part, by the multiple alignment results. But still, I couldn't figure out why the TFC resulted from the collapse method was less than the classify?

Looking forward to your response.

ProsperP commented 2 years ago

Hi @qiyunzhu, haven't heard back from you since my last update. Still consider this difference is caused by different mappings? Hope to hear from you soon.

qiyunzhu commented 2 years ago

Hello @ProsperP My apology for the late response. Hope that this question is still in your radar. I appreciate your careful testing and comparison of the outputs. I did some research and get an answer for you.

Basically, thid discrepancy is caused by decimal number rounding during multiple alignment (your observation is correct!). If there are multiple (say 3) features mapped to one read, each result will receive 1 / 3 counts. Woltka reports integers by default. Therefore, if a feature only has 1/3 counts in the entire dataset, it will end up becoming 0, and dropped from the result.

This may be okay because those very low abundance features can just be false positives. However, it can cause discrepancy in the reported numbers. In your case for example, if we directly classify reads to MetaCyc proteins using woltka classify, each protein has a much less chance to be just 1/3 in the entire dataset and rounded to zero. But if we use OGU as a proxy, i.e., woltka classify then woltka collapse, then every OGU has a much higher chance to be eliminated. Therefore, you observed a smaller number of features in the later case.


To solve this problem, you can instruct Woltka to retain a few digits after the decimal point, using --digits. Here are my tests (number of features):

Default behavior (everything is integer):

Add --digits 3 to the commands:

As you see, higher precision made the numbers consistent.

For your reference, the full commands are:

woltka classify --no-demux --to-tsv --input C-27.sam.gz --coords coords.txt.xz --rank none --stratify mapdir --output ogu_orf.3.tsv --digits 3
woltka classify --no-demux --to-tsv --input C-27.sam.gz --coords coords.txt.xz --map metacyc/protein.map.xz --rank none --stratify mapdir --output ogu_protein.3.tsv --digits 3
woltka tools collapse -f 2 -m metacyc/protein.map.xz -i ogu_orf.3.tsv -o ogu_protein.from_collapse.3.tsv

You raised a very important question, which led me to rethink what is the optimal default protocol for doing functional analysis using Woltka. The integer behavior was inherited from the classical microbiomics (OTU counts...). However, for instances where multiple alignments are the norm and the requirement for precision is high, we may not rely on integers. Actually, in gene expression analysis, and in previous microbiome gene abundance analysis (such as HUMAnN), feature abundance is usually calculated as RPK, which has decimal points. Woltka supports RPK calculation too, as documented here. In briefly, you can do:

woltka classify --no-demux --to-tsv --input C-27.sam.gz --coords coords.txt.xz --map metacyc/protein.map.xz --rank none --stratify mapdir --output ogu_protein.3.tsv --sizes . --scale 1k --digits 3

You will also get 1663 features, which is consistent.

I hope this answer is useful to you. I would like to thank you again for doing the test and pointing out this issue, which greatly helps us to improve the protocol.

ProsperP commented 2 years ago

Hi @qiyunzhu, many thanks to your testing, explanation and suggestion on gene expression analysis. I think your information and suggestion helps a lot on my subsequent analysis using woltka. Although it also made me think if it's appropriate to include un-concordantly mapped reads (i.e., for a paired reads, first read mapped to genome A and its paired read mapped to genome B) in the alignment results before classification step.

Since a pair of reads could only come from a single genome, un-concordantly mapped reads would be a misalignment. From my point of view, even in the situation where a paired-read could be mapped to multiple genomes, they should be mapped concordantly for a LCA assignment. So, I think those un-concordant alignments should be removed to improve the precision of the classification. However, this question and any other questions come over my mind after reading your response are out the scope of the original question of mine.

Anyway, thanks again for your help in testing, explanation, suggestion and, also, the great effort to the great woltka.

qiyunzhu commented 2 years ago

Hi @ProsperP Thank you for your further comments! This is a very thoughtful suggestion. Implementing it shouldn't be hard (although implementing it without reducing performance needs some thoughts). To my knowledge, many popular metagenome profilers don't take alignment concordance into account. It may be worth doing a careful benchmarking study to see whether and how this plan improves (or the opposite) the goodness of results. Perhaps such benchmark studies are already there but I don't know. Are you aware of some?

ProsperP commented 2 years ago

Hi @qiyunzhu , glad to see that you're also interested in this topic. I am also not aware of any study on this topic.

For now, I only have a few thoughts on this.

First, we do need, if not a comprehensive benchmark, at least some small tests on whether and how alignment concordance impact downstream analysis in the filed of metagenomics to make a decision. As far as I know, many metagenomic profilers which uses alignment as first step don't fully use the pair information from paired reads. Using MetaPhlAn as an example, although claimed to natively handle paired-end metagenomes (in its python code), it simply concatenates paired fastq files into a single fastq file and treats it as single-end sequencing data even this software uses bowtie2 which supports pair-end alignment for its alignment step. This strategy might be inherited from its original version when the reference-based metagenomic sequencing data were mostly sequenced through NextSeq 500/550 using SE75 mode. However, nowadays, metagenomic sequencing data are mostly sequenced using PE150 mode with a insert size (i.e., sequencing library fragment length) of ~350bp. So the inner distance of the sequenced pair-end reads is about 50bp long, which can be used for improving the alignment precision theoretically. But maybe due to some historical reason or computational expense consideration, the pair information from paired reads are not fully used by some metagenomic profilers up to date. Since we don't know previous case of filtering out discordance alignment in metagenomic fields and we should not make a conclusion (of improving alignment precision) without evidence, the impact of utilizing concordance on metagenomic alignment and classification should be tested.

And, actually, I think woltka already has some potential to handle discordant alignments. As I've noticed in one of my tests that woltka produces different mapping output format when encounters different mapping results. If a pair of reads mapped to multiple genomes in proper pair, woltka would assign each read an OGU once. And if a pair of reads mapped to different genomes respectively, woltka would assign multiple OGU entry to each read. Example output produced using only two reads with SHOGUN protocol: test.sam.gz test_ogu_map.txt

On the other hand, even if we are going to filter out unconcordant alignments, I think there's no need to implement this function into classifier itself for that there's already some tools could be used to do this. Using bowtie2 as an example, itself has parameters to filter out discordant alignments for paired reads (--no-discordant). In case where alignment software doesn't embed similar function into itself, existing toolkits like samtools view might be capable of performing the filtering.

These are just some quick thoughts of mine. And thanks for your time in discussion with me.

qiyunzhu commented 2 years ago

Hello @ProsperP Thanks for sharing your further thoughts, which are very insightful. The point on the morphing of sequencing technology and format makes good sense to me. Similar to this question, people are also hesitating whether to merge forward / reverse reads or not. There are many things to be done in order to perfectize the analysis! I envision that utilizing the pair information instead of merging them (given the gap), as you suggested, may be the best solution.

As you noted, Woltka has some design which unintentionally reduces the impact of discordant alignments. I also agree with you that the Bowtie2 parameter --no-discordant may solve problem. One thing I am not sure about is that Bowtie2 is a heuristic aligner. It will stop at the first reference sequence where it finds a good enough match. If the database is large and contains many closely related reference sequences, there may be a chance that the forward and reverse reads are mapped to different sequences, and get discarded by using the --no-discordant parameter. I am thinking that perhaps the ideal solution is to let Woltka examine how far into the taxonomic / phylogenetic distance can the forward and reverse reads converge, and classify the read pair according to this metric. Might be a useful study. My email is qiyunzhu@gmail.com in case you would like discuss more offline.