More example of yaml please

FsheriF commented 8 months ago

Dear immuneML team:

Thanks for your amazing immuneML tool. I am using the local version of immuneML for repertoire classification. I found in the officla turotial:https://docs.immuneml.uio.no/latest/installation.html, there are just few completing example yaml using k-mear encoding or sequenceabundance. Even though there are detailed introduction for various encoding and ML method. But how to correctly combined different 'encodings' with compataible 'ml_methods' and also 'instructions' in to complete woriking yaml file is still confusing for me. Especially, using word2vec/DeepRC/TCRdist encoding for repertoire classification.

If you guys can kindly provide as much as possible yaml file that combine different encoding and ml method for repertoire classification, I would be very appreciate for your help.

Best regards,

LonnekeScheffer commented 8 months ago

Dear Zichang Xu,

Thank you for your interest! There are indeed a lot of possible combinations between encodings and ML methods for different dataset types. We have included a figure in the documentation which draws 'paths' between the different legal combinations of dataset types, encodings, ML methods, and reports. The figure can be found on the YAML specification documentation page: https://docs.immuneml.uio.no/latest/specification.html#.

In the figure you'll see that some ML methods are limited to a specific dataset type, and thus not every encoder/ML method can be used for repertoire datasets. The TCRdist encoder is limited to receptor datasets, so you will not be able to use it on your repertoire dataset.
The Word2Vec encoder (as well as the 3 other encoders in the same block: OneHot encoder, KmerFrequency encoder, EvennessProfile encoder) can be used in combination with this block of ML methods: LogisticRegression, RandomForestClassifier, SVM, SVC, KNN. All of these components can be used for repertoire datasets.
The DeepRC encoder can be used with the DeepRC ML method. This method is a bit advanced to use, and may require some effort to get to work. Note that you must have access to a GPU to be able to run the method (a standard CPU will not work) and have the optional DeepRC dependencies installed.

We have a small YAML example for each individual encoding or ML method for each of these components in the linked documentation pages. The only instruction for training ML models is the TrainMLModel instruction, and this tutorial may be of use to you: https://docs.immuneml.uio.no/latest/tutorials/how_to_train_and_assess_a_receptor_or_repertoire_classifier.html The total number of legal YAML specifications is too large to give individual examples, but you can simply take the YAML example given in this tutorial, and replace the YAML bits for individual encoders and ML methods with the bits in the linked documentation pages.

I hope this explanation was helpful to you. We will continue to improve our documentation in the future, and please do let us know if anything specific is missing such that it can be added to future releases of the platform. As this issue does not concern a bug, I will close it for now. But please feel free to reach out if more help is needed. You can in this case send an email to: contact@immuneml.uio.no

FsheriF commented 8 months ago

Dear Lonneke Scheffer

Thanks for your detailed explannation. I know figure out the possible combination from the. BTW, can I ask the input 'sequence_aas' sequences can also be set as like full length virable region or cdr1+2+3 not only for cdr3?

Thanks in advance.

LonnekeScheffer commented 8 months ago

Dear Zichang Xu,

It depends what you mean, the parameter 'sequence_aas' is used in different places. Typically, immuneML only uses the CDR3 sequence. That is why during import, the name 'sequence_aas' is by default matched to the column name of the CDR3 sequence. If you want to use full sequence information, this can be specified during import by replacing the column_mapping (see for example the AIRR import documentation.) -> it should be noted that a lot of functionalities in immuneML presume the sequences are just CDR3 sequences, so some functionalities may not perform as well if full sequences or cdr1+2+3 are supplied this way. Whether that's a good idea depends on what you'll use it for later.. If you're matching full concatenated cdr1+2+3 sequences I suppose it can be ok, but if you're using something like K-mer encoding, you can get strange results (with K-mers spanning across several CDRs and so on). It's a little bit 'hacky', but I'm happy to provide advice if more specific questions come up.

Alternatively, some encoders and ML methods it is supported to use V and J gene information. This will however depend on the individual encoders/methods, there is no one-size-fits-all explanation. Here is one example: the Emerson method (detailed example tutorial here) uses the SequenceAbundance encoder. In this encoder, you can choose to use only the sequence (CDR3) information, but alternatively you can also require the V and J genes to match, like so:

    sequenceabundance_with_vj_genes:
      SequenceAbundance:
        comparison_attributes: 
          - sequence_aas
          - v_genes
          - j_genes

    sequenceabundance_only_cdr3:
      SequenceAbundance:
        comparison_attributes: 
          - sequence_aas

In the first example, two sequences are only a 'match' if CDR3 is identical and V and J genes have the same name. In the second example, only the CDR3 sequence must match and V and J genes may differ. Note that this does not use the sequence of the V and J genes, only their name (e.g., V1-1, etc...).

I hope this helps!

Lonneke

uio-bmi / immuneML

More example of yaml please #168