Galaxy interface trims CDR3 residues in create dataset

agirgis3 commented 9 months ago

Hello, in attempting to generate immuneML datasets using the galaxy interface, I am unable to specify trim_leading_trailing: false in the 'simple' parameter mode. I would like more control by using my own .yaml file in the galaxy Create Dataset interface.

However, I am also unable to upload my own .yaml specification for this purpose, because it seems the data files are loaded using some temporary cache that I do not know the filepath for.

Please let me know if this query makes sense, if not happy to elaborate.

LonnekeScheffer commented 9 months ago

Hi Alexander, thanks for reaching out! What type of dataset are you working with? Sequence/receptor or repertoire dataset?

agirgis3 commented 9 months ago

Hi Lonneke, I am working with sequence datasets of unpaired TRB chains, provided in AIRR format.

LonnekeScheffer commented 9 months ago

I have looked a bit deeper into this. We have documentation for this tool here: https://docs.immuneml.uio.no/latest/galaxy/galaxy_dataset.html#using-the-advanced-create-dataset-interface

I went back and forth a bit with the tool myself and I was able to get it to work with an example dataset. There is a "temporary cache path" message is printed at the beginning of the stdout, but this is not an error. So I do believe the real issue is something else. When your Galaxy tool crashes, it's possible to send a bug report from Galaxy so we can investigate the history and error directly (described in the docs here: https://docs.immuneml.uio.no/latest/galaxy/galaxy_intro.html#viewing-errors-and-reporting-bugs-in-galaxy)

If you're editing an existing yaml file (such as the one outputted by the create dataset tool), please make sure to:

remove the result_path parameter
when working with a repertoire dataset, change the name of the metadata_file to metadata.csv (and then ensure your metadata file shares the same name). When working with another dataset type, remove the metadata_file parameter.
change region_type: IMGT_CDR3 to region_type: IMGT_JUNCTION -> this will ensure the leading and trailing conserved amino acids are kept ("trim_leading_trailing" is not an existing import parameter).

When working with repertoire datasets, make sure to select the metadata file as one of the input files (in addition to the repertoire files). If you have a lot of repertoires and they're in a collection, the metadata file should be added to the collection in addition. As a simple example, I was able to make this work with the following yaml specification:

definitions:
  datasets:
    dataset:
      format: AIRR
      params:
        is_repertoire: true
        metadata_file: metadata.csv
        region_type: IMGT_JUNCTION
instructions:
  my_dataset_generation_instruction:
    datasets:
    - dataset
    export_formats:
    - ImmuneML
    type: DatasetExport

I hope this was helpful! I'm closing the issue for now, but feel free to reach out or send a bug report if it doesn't work out.

uio-bmi / immuneML

Galaxy interface trims CDR3 residues in create dataset #167