uio-bmi / immuneML

immuneML is a platform for machine learning analysis of adaptive immune receptor repertoire data.
https://immuneml.uio.no
GNU Affero General Public License v3.0
60 stars 29 forks source link

Galaxy interface trims CDR3 residues in create dataset #167

Closed agirgis3 closed 9 months ago

agirgis3 commented 9 months ago

Hello, in attempting to generate immuneML datasets using the galaxy interface, I am unable to specify trim_leading_trailing: false in the 'simple' parameter mode. I would like more control by using my own .yaml file in the galaxy Create Dataset interface.

However, I am also unable to upload my own .yaml specification for this purpose, because it seems the data files are loaded using some temporary cache that I do not know the filepath for.

Please let me know if this query makes sense, if not happy to elaborate.

LonnekeScheffer commented 9 months ago

Hi Alexander, thanks for reaching out! What type of dataset are you working with? Sequence/receptor or repertoire dataset?

agirgis3 commented 9 months ago

Hi Lonneke, I am working with sequence datasets of unpaired TRB chains, provided in AIRR format.

LonnekeScheffer commented 9 months ago

I have looked a bit deeper into this. We have documentation for this tool here: https://docs.immuneml.uio.no/latest/galaxy/galaxy_dataset.html#using-the-advanced-create-dataset-interface

I went back and forth a bit with the tool myself and I was able to get it to work with an example dataset. There is a "temporary cache path" message is printed at the beginning of the stdout, but this is not an error. So I do believe the real issue is something else. When your Galaxy tool crashes, it's possible to send a bug report from Galaxy so we can investigate the history and error directly (described in the docs here: https://docs.immuneml.uio.no/latest/galaxy/galaxy_intro.html#viewing-errors-and-reporting-bugs-in-galaxy)

If you're editing an existing yaml file (such as the one outputted by the create dataset tool), please make sure to:

When working with repertoire datasets, make sure to select the metadata file as one of the input files (in addition to the repertoire files). If you have a lot of repertoires and they're in a collection, the metadata file should be added to the collection in addition. As a simple example, I was able to make this work with the following yaml specification:

definitions:
  datasets:
    dataset:
      format: AIRR
      params:
        is_repertoire: true
        metadata_file: metadata.csv
        region_type: IMGT_JUNCTION
instructions:
  my_dataset_generation_instruction:
    datasets:
    - dataset
    export_formats:
    - ImmuneML
    type: DatasetExport

I hope this was helpful! I'm closing the issue for now, but feel free to reach out or send a bug report if it doesn't work out.