hillerlab / TOGA

TOGA (Tool to infer Orthologs from Genome Alignments): implements a novel paradigm to infer orthologous genes. TOGA integrates gene annotation, inferring orthologs and classifying genes as intact or lost.
MIT License
151 stars 23 forks source link

Questions about custom training of XGBoost models #149

Closed fuesseler closed 6 months ago

fuesseler commented 6 months ago

Dear authors,

As I am working with squamate genomes I wanted to test retraining the TOGA models for this group, and I have some questions regarding this. Any help is much appreciated!

1) I was following your tutorial, using the "create_train.ipynb" and the "train_model.py" scripts to train SE and ME models. I had to do several minor edits to the ipython notebook to produce a training dataset that is accepted by the train_model.py, because using the ipython notebook "as is" as it was throwing some errors about some required features missing. While I am not particularly skilled with python and the changes I made might not be the most elegant, I wanted to ask whether you would like to take a look at this edited version of create_train.ipynb? Maybe other people attempting the training might find it useful not to have to troubleshoot this again, and I'd be happy to share.

2) I was wondering about the stats of my custom-trained XGBoost models. I feel like the accuracy is quite lower (and the training dataset is also smaller) than what you report in your paper. Of course, as far as I saw the "artificially rearranged" gene-chain pairs that you did in the publication is not included in the ipython notebook. So that might explain it? Or do you think I might have to go for 2 more closely related Ensembl Annotations as a starting point (I could do the Green Anole + a snake genome instead).

This is the log I was getting:

Training dataset size: 10404 Single exon train length: 810; multi exon: 9594 Single exon model: Training on 810 samples Positives: 255; Negatives: 555 Using features: Index(['gl_exo', 'flank_cov', 'exon_perc', 'synt_log'], dtype='object') Accuracy: 95.309 0.741 Model saved to: /work/uesseler/software/TOGA/models/se_model_AnoCar_SalMer.dat Multi exon model: Training on 9594 samples Positives: 4609; Negatives: 4985 Using features: Index(['gl_exo', 'loc_exo', 'flank_cov', 'synt_log', 'intr_perc'], dtype='object') Accuracy: 96.102 0.449 Model saved to: /work/uesseler/software/TOGA/models/me_model_AnoCar_SalMer.dat Done in 0:03:16.825670

3) I wanted to try out the new model I trained but could not find any details on how to specify that TOGA use the custom models instead of the standard one. Can you please point me the right way?

Best regards, Flora

MichaelHiller commented 6 months ago

Hi Flora, regarding re-training @kirilenkobm is the expert. Hopefully he can respond to your questions.

However, if you are stuck with this, I would suggest running TOGA with the standard trained model (human-mouse), which we know does a great job for turtles, fish, birds and therefore likely reptiles as well.

fuesseler commented 6 months ago

Thanks! I have already been using the standard trained model - I was just curious whether retraining was worth it to improve results further, so I thought I'd at least give it a try. For now I'll continue with the standard model, but still happy to get some answers, once kirilenkobm finds the time

francicco commented 6 months ago

Hi,

If I may jump in. I'm working on a dataset of springtails. A much deeper phylogenetic framework than mammals for example. I'd also be interesting to train toga for this. But I have no idea where to start from. Cheers F

kirilenkobm commented 6 months ago

Hi all!

Your English is already pretty solid, but I've made a few corrections and adjustments for clarity and grammatical accuracy. Here's the revised version:

Hi all!

Thanks for bringing up this point. The general idea was that the classifier module is customizable. However, in practice, it was written a few years ago and hasn't been revisited since.

@fuesseler, answering your questions:

I'm sorry... this notebook is really outdated. I'd be happy to look at the polished version. train_model.py is used to create the models and must output the quality metrics, which should match the reported values. I should mention that the augmented part of the dataset contributed significantly. We also invested a lot of time in distilling the training data. However, there's a chance that the model is slightly overfitted towards mammalian species. For instance, among mammals, synteny isn't a crucial feature, but its importance may increase when we annotate a platypus based on human data, just as an example. Shame on me, but they are hardcoded in toga.Toga.__classify_chains like this:

        self.se_model = os.path.join(self.LOCATION, "models", "se_model.dat")
        self.me_model = os.path.join(self.LOCATION, "models", "me_model.dat")

@francicco, first, please take a look at the train_model.py script. In fact, TOGA extracts more features than it needs, and it uses two models - one for single-exon and another for multi-exon genes, with each model having a slightly different set of features. The entire dataset is in models/train.tsv, but then it's split into two, respective to the target models. See constants.py:

    # lists of features required by single and multi exon models
    SE_MODEL_FEATURES = ["gl_exo", "flank_cov", "exon_perc", "synt_log"]
    ME_MODEL_FEATURES = ["gl_exo", "loc_exo", "flank_cov", "synt_log", "intr_perc"]

The script creates two model files:

        self.se_model = os.path.join(self.LOCATION, "models", "se_model.dat")
        self.me_model = os.path.join(self.LOCATION, "models", "me_model.dat")

Then, they are used in Toga.__classify_chains, which triggers the modules/classify_chains.py script - it takes paths to the two models, and data extracted from chains, and essentially applies the models to this data.

The input data for classification is stored in self.chain_results_df -> os.path.join(self.temp_wd, "chain_results_df.tsv"), in the output directory. The crucial part is the output of this script, which is saved as self.transcript_to_chain_classes -> self.transcript_to_chain_classes = os.path.join(self.temp_wd, "trans_to_chain_classes.tsv"), and self.pred_scores = os.path.join(self.wd, "orthology_scores.tsv")

The prediction scores are just a table with three columns: transcript (which should be called "gene" in the code, but it's actually a transcript), chain, and score - a float from 0 to 1, where 1 indicates an ortholog, and 0 indicates a paralog. This is just raw model output.

(!) However, TOGA distinguishes four classes of chains, and ML models are used only to separate orthologs from paralogs! It also separates spanning chains (their separation is quite trivial and does not require ML) and processed pseudogene chains. So, the second output file, "trans_to_chain_classes.tsv", contains information for these four classes. This is how the second file is created: https://github.com/hillerlab/TOGA/blob/97eb5a17ce76fccd9858b2ed738c51cd661292aa/modules/classify_chains.py#L253 Basically, it contains a header: f.write(f"GENE\t{ORTH}\t{PARA}\t{SPAN}\t{P_PGENES}\n") (again, it should say "transcript" instead of "gene") Ah ja, important detail - in the raw output file, spanning chains have "-1" score, and processed pseudogenes - "-2" to not confuse them with truly classified transcript/chain pairs.

Then, each row contains tab-separated fields. The first one is the transcript ID. The second is a comma-separated list of orthologous chain IDs, the third is the same for paralogs, and so on.

If the group is empty, it is just 0.

To customize this part of TOGA deeply, I'd recommend patching and tuning the __classify_chains function. self.chain_results_df contains features extracted from chains. And it must fill two files: self.transcript_to_chain_classes and self.pred_scores.

You could select a set of features that best fit your goal - using and combining already existing features would be the simplest option. Create some training dataset with columns like transcript + chain + X features + y target. Then train a model(s?), write a script that applies it/them to the dataset stored in self.chain_results_df, and produces two output files. This is a very general high-level overview; I'd be happy to answer more detailed questions :).