harish0201 / Analyses_Pipelines

Analyses pipelines that I use for my work
12 stars 7 forks source link

No filtering was done! Exiting... #2

Open mictadlo opened 3 years ago

mictadlo commented 3 years ago

Hi, Thank you for sharing your cafe pipeline. However, I ran into the following problems:

> awk -F'\t' '{print "(null)\t"$0}' Orthogroups.GeneCount.tsv > tmp.tsv
> sed -i.bak 's|(null)|Desc|g' tmp.tsv

> head tmp.tsv
Desc    Orthogroup  Atha    Cann    Csat    Mgut    Natt    Nlab    Nobt    Noto    Nqld    Nsyl    Ntab    Ntom    Ptri    Slyc    Smel    Stub    Vvin    Total
Desc    OG0000000   0   0   0   0   0   3   0   0   982 0   0   1   0   0   0   986
Desc    OG0000001   0   9   0   0   5   23  8   59  3   31  267 14  0   6   113 540

> python2.7 ~/scripts/cafe/cafetutorial_clade_and_size_filter.py -i tmp.tsv -s -o cafe.input.tsv
No filtering was done! Exiting...

> conda activate cafe
> cafe
# load -i tmp.tsv reports/log_run1.txt
Failed to load file

What did I miss?

Thank you in advance

harish0201 commented 3 years ago

Ah! It's been some time that I've looked at this, so apologies for the confusion ;)

In the tmp.tsv, as per CAFE's format, the first column should be the (null) with the header being Desc. So what you've shared should look like:

head tmp.tsv
Desc    Orthogroup  Atha    Cann    Csat    Mgut    Natt    Nlab    Nobt    Noto    Nqld    Nsyl    Ntab    Ntom    Ptri    Slyc    Smel    Stub    Vvin    Total
(null)  OG0000000   0   0   0   0   0   3   0   0   982 0   0   1   0   0   0   986
(null)  OG0000001   0   9   0   0   5   23  8   59  3   31  267 14  0   6   113 540

IIRC, this is what the awk portion actually does, only the "first column header" should be the "Desc".

Once this is done, you can run the cafetutorial_clade_and_size_filter.py script that CAFE has.

In any case, what it does is, it's going to remove the OGs/Groups if any of the species have more than 100 proteins in them.

Once this is done, you'd need to create a rooted tree. I recommend using Timetree.org or other age approaches (ape package in R is a good started)

Unfortunately, the way I've arranged the CAFE thing is essentially a guideline as to how I do it, rather than a complete pipeline because I kept running into issues :(

I might automate it some day, hopefully, but the use case for me is fairly too low to justify the time sunk into it.

mictadlo commented 3 years ago

Thank you. The new file looks like this:

> head tmp.tsv
Desc    Orthogroup  Atha    Cann    Csat    Mgut    Natt    Nlab    Nobt    Noto    Nqld    Nsyl    Ntab    Ntom    Ptri    Slyc    Smel    Stub    Vvin    Total
(null)  OG0000000   0   0   0   0   0   3   0   0   982 0   0   1   0   0   0   0   0   986
(null)  OG0000001   0   9   0   0   5   23  8   59  3   31  267 14  0   6   113 2   0   540
(null)  OG0000002   0   128 3   20  21  21  8   42  21  36  70  37  3   25  38  39  22  534
(null)  OG0000003   1   66  12  20  29  17  12  42  21  30  60  21  5   19  36  84  42  517

The age (28 MYA) has been provided to me by a collaborator. Unfortunately, I got a new error:

> make_ultrametric.py -r 28 SpeciesTree_rooted.txt
> head SpeciesTree_rooted.txt.ultrametric.tre
(((Ptri:21.5258,(Csat:19.4873,Atha:19.4873):2.03847):2.54693,Vvin:24.0727):3.92728,(Mgut:23.5518,((Cann:4.94278,((Stub:1.91372,Slyc:1.91372):1.99847,Smel:3.91219):1.0306):2.54854,((Natt:2.59231,((Nsyl:1.44399,Ntab:1.44399):0.826096,(Nlab:0.508663,Nqld:0.508663):1.76142):0.322218):0.530061,((Ntom:1.26249,Noto:1.26249):1.11482,Nobt:2.37732):0.745047):4.36896):16.0605):4.44819);

python2.7 ~/scripts/cafe/cafetutorial_clade_and_size_filter.py -i tmp.tsv -s -o cafe.input.tsv
> cafe
# load -i cafe.input.tsv
-----------------------------------------------------------
Family information: cafe.input.tsv
Log: stdout
The number of families is 35104
Root Family size : 1 ~ 124
Family size : 0 ~ 149
P-value: 0.01
Num of Threads: 1
Num of Random: 1000
# tree (((Ptri:21.5258,(Csat:19.4873,Atha:19.4873):2.03847):2.54693,Vvin:24.0727):3.92728,(Mgut:23.5518,((Cann:4.94278,((Stub:1.91372,Slyc:1.91372):1.99847,Smel:3.91219):1.0306):2.54854,((Natt:2.59231,((Nsyl:1.44399,Ntab:1.44399):0.826096,(Nlab:0.508663,Nqld:0.508663):1.76142):0.322218):0.530061,((Ntom:1.26249,Noto:1.26249):1.11482,Nobt:2.37732):0.745047):4.36896):16.0605):4.44819);
(((Ptri:21.5258,(Csat:19.4873,Atha:19.4873):2.03847):2.54693,Vvin:24.0727):3.92728,(Mgut:23.5518,((Cann:4.94278,((Stub:1.91372,Slyc:1.91372):1.99847,Smel:3.91219):1.0306):2.54854,((Natt:2.59231,((Nsyl:1.44399,Ntab:1.44399):0.826096,(Nlab:0.508663,Nqld:0.508663):1.76142):0.322218):0.530061,((Ntom:1.26249,Noto:1.26249):1.11482,Nobt:2.37732):0.745047):4.36896):16.0605):4.44819)
No species 'total' was found in the tree

What did I miss?

Thank you in advance,

Michal

harish0201 commented 3 years ago

Hey!

You have to remove the "Total" column, that won't be needed. The error specifies that :)

mictadlo commented 3 years ago

Thank you,, I have forgotten to do awk -F'\t' '{$NF=""; print $0}' tmp.tsv | rev | sed 's/^\s*//g' | rev | tr ' ' '\t' > mod.tsv. Unfortunately, I got now Tree must be binary:

> awk -F'\t' '{$NF=""; print $0}' tmp.tsv | rev | sed 's/^\s*//g' | rev | tr ' ' '\t' > mod.tsv
> python2.7 ~/scripts/cafe/cafetutorial_clade_and_size_filter.py -i mod.tsv -s -o cafe.input.tsv  
>  cafe
# load -i cafe.input.tsv
-----------------------------------------------------------
Family information: cafe.input.tsv
Log: stdout
The number of families is 27493
Root Family size : 1 ~ 124
Family size : 0 ~ 149
P-value: 0.01
Num of Threads: 1
Num of Random: 1000
# tree (((Ptri:21.5258,(Csat:19.4873,Atha:19.4873):2.03847):2.54693,Vvin:24.0727):3.92728,(Mgut:23.5518,((Cann:4.94278,((Stub:1.91372,Slyc:1.91372):1.99847,Smel:3.91219):1.0306):2.54854,((Natt:2.59231,((Nsyl:1.44399,Ntab:1.44399):0.826096,(Nlab:0.508663,Nqld:0.508663):1.76142):0.322218):0.530061,((Ntom:1.26249,Noto:1.26249):1.11482,Nobt:2.37732):0.745047):4.36896):16.0605):4.44819);
(((Ptri:21.5258,(Csat:19.4873,Atha:19.4873):2.03847):2.54693,Vvin:24.0727):3.92728,(Mgut:23.5518,((Cann:4.94278,((Stub:1.91372,Slyc:1.91372):1.99847,Smel:3.91219):1.0306):2.54854,((Natt:2.59231,((Nsyl:1.44399,Ntab:1.44399):0.826096,(Nlab:0.508663,Nqld:0.508663):1.76142):0.322218):0.530061,((Ntom:1.26249,Noto:1.26249):1.11482,Nobt:2.37732):0.745047):4.36896):16.0605):4.44819)
# lambda -s -t ((1,1)1,(1,1):1,1)
Tree must be binary

What did I miss?

Thank you in advance,

Michal

harish0201 commented 3 years ago

The lambda tree that you've pasted should look like the tree that you've generated (Ptri.....)

or rather taking a small portion of your ultrametric tree: (Ptri:21.5258,(Csat:19.4873,Atha:19.4873):2.03847):2.54693,Vvin:24.0727)

the lambda tree would look like: (1,(2,2)),1) assuming Csat is your species of interest. My recommendation is go through the CAFE tutorial once though, because it covers this in more detail.