mgalardini / pyseer

SEER, reimplemented in python 🐍🔮
http://pyseer.readthedocs.io
Apache License 2.0
104 stars 25 forks source link

Empty kmer_patterns.txt file with --wg enet #203

Closed charlottecc closed 2 years ago

charlottecc commented 2 years ago

Hi all,

I’m new to GWAS and have run fsm-lite K-mer output through pyseer with mixed effect, fixed effect and elastic net models, correcting for population structure.

Here is my elastic net command: pyseer --phenotypes case_trait.txt \ --kmers fsm-kmers.txt.gz --wg enet --cpu 40 \ --lineage-clusters baps_lineages.txt \ --output-patterns kmer_patterns_eneta001_bapslin.txt --print-samples --save-model case.lasso --alpha 0.01 > kmers_enet_a001-bapslineages.txt

With fixed effect and mixed effect models, I get hits in the txt file output and in kmer_patterns.txt output. However when I run the elastic net model I get an empty kmer_patterns file so cannot then go onto the post processing scripts.

Is there a reason for this? Is the elastic net model appropriate to use with kmers and cogs?

I’d appreciate any insight!

Thanks in advance

mgalardini commented 2 years ago

Hi, what was the output from the command line? What could have happened is that no variants were actually tested: the number of read/filtered and tested variants should be given in the stderr.

charlottecc commented 2 years ago

Thanks for getting back to me!

This is the output file produced : kmers_enet_a001-bapslineages.txt

This is the print out after the command was run: Fitting elastic net to top 8220278 variants [status] Parallel glmnet cv with 40 cores Best penalty (lambda) from cross-validation: 4.96E+00 Best model deviance from cross-validation: 0.821 ± 8.23E-02 Best R^2 from cross-validation: -0.165 Predictions within each lineage Lineage Size R2 TP TN FP FN 1 360 -0.094 329 0 31 0 2 133 -0.304 102 0 31 0 3 142 -0.257 113 0 29 0 4 32 -0.333 24 0 8 0 5 21 -0.050 20 0 1 0 6 77 -0.132 68 0 9 0 7 4 nan 4 0 0 0 Finding and printing selected variants Saved enet model as case.lasso.pkl 19236117 loaded variants 11015839 pre-filtered variants 8220278 tested variants 1095 printed variants

johnlees commented 2 years ago

Output patterns aren't output with the elastic net, sorry that's not at all clear from the docs, and we should probably add a warning/check for this when these options are used together. (Note on code: the file is opened, but the lines that write patterns are only in fixed and LMM)

I am not sure of a use of the number of patterns with the elastic net – could you describe the intended use? If you want to adjust the p-values, these are from the fixed model, so you could just use the number of patterns from that.

(and elastic net is fine with k-mers and COGs just the same. You might want to consider using unitigs rather than k-mers as it makes a few things easier though)

charlottecc commented 2 years ago

Thank you for getting back to me and clarifying about the K-mer patterns with elastic net.

I was intending to use the K-mer patterns with the count_patterns.py to determine a significance threshold, the same as described for the fixed and mixed effect models. Can these be taken from the fixed effect and mixed effect models and compared with the outputs from elastic net?

I have run pyseer with fsm-lite and unitig outputs. I did not get any tested variants with the output from unitig-counter. command and pyseer output:

unitig-counter -strains strain_list.txt --pyseer -output unitig_output -nb-cores 24

pyseer --lmm --phenotypes case_trait.txt --kmers ~/gwas_GEMS_VIDA/kmer_gemssf6/unitig_output/unitigs.txt --uncompressed \ --similarity phylogeny_similarity.tsv \ --output-patterns kmer_patterns_lmmuni.txt --cpu 24 > case_kmers-unilmm.txt

394360 loaded variants 394360 pre-filtered variants 0 tested variants 0 printed variants

johnlees commented 2 years ago

I was intending to use the K-mer patterns with the count_patterns.py to determine a significance threshold, the same as described for the fixed and mixed effect models. Can these be taken from the fixed effect and mixed effect models and compared with the outputs from elastic net?

Yes in principle, as the p-values from the elastic net are exactly the same as the fixed effect p-values. But normally the elastic net we'd just use those selected, rather than a threshold. The p-values can just make those selected nicer to plot.

I have run pyseer with fsm-lite and unitig outputs. I did not get any tested variants with the output from unitig-counter. command and pyseer output:

unitig-counter -strains strain_list.txt --pyseer -output unitig_output -nb-cores 24

pyseer --lmm --phenotypes case_trait.txt --kmers ~/gwas_GEMS_VIDA/kmer_gemssf6/unitig_output/unitigs.txt --uncompressed \ --similarity phylogeny_similarity.tsv \ --output-patterns kmer_patterns_lmmuni.txt --cpu 24 > case_kmers-unilmm.txt

394360 loaded variants 394360 pre-filtered variants 0 tested variants 0 printed variants

Looks like something's gone wrong here as everything is being filtered out. Do you get any other messages? Can you confirm that the sample names match between the phentopyes, the similarity tsv file, and the unitigs file?

mgalardini commented 2 years ago

Output patterns aren't output with the elastic net, sorry that's not at all clear from the docs, and we should probably add a warning/check for this when these options are used together.

Ah yes good point, I'd be happy to address that as soon as I can

charlottecc commented 2 years ago

Yes in principle, as the p-values from the elastic net are exactly the same as the fixed effect p-values. But normally the elastic net we'd just use those selected, rather than a threshold. The p-values can just make those selected nicer to plot.

Thanks for explaining this. I'll just use the selected hits from elastic net.

Looks like something's gone wrong here as everything is being filtered out. Do you get any other messages? Can you confirm that the sample names match between the phentopyes, the similarity tsv file, and the unitigs file?

I checked the names and I did have a discrepancy. I'd included .fasta in the ID column when running unitig-counter. I now have tested variants!! Thank you for spotting that! I'll run the other --kmer analyses with Unitigs now if that is recommended.

394360 loaded variants 161340 pre-filtered variants 233020 tested variants 233020 printed variants

Could I also just check if I'm right in thinking you shouldn't use --distances and --lineage-clusters in the same command when running -wg enet? Without --distances I don't get an lrt-pvalue

johnlees commented 2 years ago

The distances and lineage clusters are for LMM or fixed effects models yes, but, if you want p-values from the enet model notes these are actually just fixed effect p-values, hence the requirement for distances.

Hope that's all sorted, feel free to reopen if you have another issue