juaml / junifer

Forschungszentrum Jülich Neuroimaging Feature Extractor
https://juaml.github.io/junifer
GNU Affero General Public License v3.0
14 stars 13 forks source link

[BUG]: subject ID truncated incorrectly in condor submit dag file #322

Closed LeSasse closed 5 months ago

LeSasse commented 5 months ago

Is there an existing issue for this?

Current Behavior

First I queued using the datagrabber without specifying the elements. The resulting file gives elements such as "0001" etc.

When I supply a txt file with a subset of elements I want to process, the subject ID's have the 0 buffers removed.

Expected Behavior

It should output the ID's as a string so the buffering 0's should still be there.

Steps To Reproduce

  1. See the environment posted below.
  2. See the YAML i used:
    
    workdir: /tmp

datagrabber: kind: DataladAOMICID1000

preprocess: kind: Smoothing using: "nilearn" "on": "BOLD" smoothing_params: fwhm: 10

storage: kind: HDF5FeatureStorage uri: data/aomic_id1000.hdf5

markers:

queue: jobname: smoothing_test kind: HTCondor collect: on_success_only env: kind: venv name: env mem: 20G verbose: 8

3. I create a txt file in the working directory (which is also where I have the yaml) as elements.txt:

"0001" "0002" "0100" "0150"

I then run `junifer queue test.yaml --element elements.txt`

I tried this with and without quotes but both result in the wrong output (snippet from the result dag file):

JOB run0 /data/project/identification_prediction/projects/test_junifer_smoothing/junifer_jobs/smoothing_test/run_smoothing_test.submit VARS run0 element="1" log_element="1"

JOB run1 /data/project/identification_prediction/projects/test_junifer_smoothing/junifer_jobs/smoothing_test/run_smoothing_test.submit VARS run1 element="2" log_element="2"

JOB run2 /data/project/identification_prediction/projects/test_junifer_smoothing/junifer_jobs/smoothing_test/run_smoothing_test.submit VARS run2 element="100" log_element="100"

JOB run3 /data/project/identification_prediction/projects/test_junifer_smoothing/junifer_jobs/smoothing_test/run_smoothing_test.submit VARS run3 element="150" log_element="150"

JOB collect /data/project/identification_prediction/projects/test_junifer_smoothing/junifer_jobs/smoothing_test/collect_smoothing_test.submit PARENT run0 run1 run2 run3 CHILD collect


I then run the `junifer queue aomic_id1000.yaml` to see what the file **should** look like, here the result looks as follows (only a snippet):

JOB run813 /data/project/identification_prediction/projects/test_junifer_smoothing/junifer_jobs/smoothing_test/run_smoothing_test.submit VARS run813 element="0329" log_element="0329"

JOB run814 /data/project/identification_prediction/projects/test_junifer_smoothing/junifer_jobs/smoothing_test/run_smoothing_test.submit VARS run814 element="0603" log_element="0603"

JOB run815 /data/project/identification_prediction/projects/test_junifer_smoothing/junifer_jobs/smoothing_test/run_smoothing_test.submit VARS run815 element="0463" log_element="0463"

JOB run816 /data/project/identification_prediction/projects/test_junifer_smoothing/junifer_jobs/smoothing_test/run_smoothing_test.submit VARS run816 element="0911" log_element="0911"

JOB run817 /data/project/identification_prediction/projects/test_junifer_smoothing/junifer_jobs/smoothing_test/run_smoothing_test.submit VARS run817 element="0791" log_element="0791"


### Environment

```markdown
❱ junifer wtf
junifer:
  version: 0.0.4.dev822
python:
  version: 3.10.10
  implementation: CPython
dependencies:
  click: 8.1.7
  numpy: 1.26.4
  scipy: 1.11.4
  datalad: 0.19.6
  pandas: 2.1.4
  nibabel: 5.2.1
  nilearn: 0.10.2
  sqlalchemy: 2.0.28
  ruamel.yaml: 0.17.21
  httpx: 0.26.0
  tqdm: 4.66.1
  templateflow: 24.2.0
  looseversion: None
system:
  platform: Linux-6.6.13+bpo-amd64-x86_64-with-glibc2.36
environment:
  LC_CTYPE: en_US.UTF-8
  PATH: /home/lsasse/miniconda3/bin:/home/lsasse/miniconda3/condabin:/home/lsasse/bin:/home/lsasse/.dotfiles/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/X11R6/bin:/usr/local/games:/usr/games

Relevant log output

No response

Anything else?

No response

fraimondo commented 5 months ago

We are using only strings for replacements for AOMIC instead of "sub-0001". I was having that problem yesterday, as the numbers were converted to ints.

I know that maybe this is not the place, but we should try to keep with the BIDS standard and use the subjects with the prefix.

I want to merge the junifer DF with the participants.tsv and I need to manipulate the index.

fraimondo commented 5 months ago

So for this bug, the parse elements should not convert from text to int. But how can we know when it's ok to do it? or when it's not?

LeSasse commented 5 months ago

the elements should always be string literals in my opinion. Maybe that requires changing some data grabber implementations? but that is in my opinion the easiest and best solution

fraimondo commented 5 months ago

the elements should always be string literals in my opinion. Maybe that requires changing some data grabber implementations? but that is in my opinion the easiest and best solution

Not at all. I think it's the element parser that should not convert to int. Indeed you can see here that some of this things is converting to int. Even though we specifically want strings.

https://github.com/juaml/junifer/blob/a9e799caa3433d3d4eaa633d4565771b8eb297f6/junifer/api/cli.py#L98-L124

LeSasse commented 5 months ago

converting to int

Yeah it should never do that in my opinion. I would always expect that the elements in the text files are seen as strings.

synchon commented 5 months ago

Following the BIDS standard and having sub- prefix in my opinion is also the best way. The way pandas.DataFrames handles data loading from CSV is kind of the issue here. Although, the code is meant to convert as much as possible to str, it misses this and we get this issue.

LeSasse commented 5 months ago

maybe https://numpy.org/doc/1.18/reference/generated/numpy.loadtxt.html is better suited in this case since you can specify a delimiter and tell it the dtype=str so then it should all be loaded as str

synchon commented 5 months ago

Here's a relevant issue: https://stackoverflow.com/questions/16988526/pandas-reading-csv-as-string-type

The pandas type sniffer is the culprit from what I understand.

LeSasse commented 5 months ago

Also I agree with the BIDS standard thing, but still think the bug is relevant since not all datasets follow the BIDS standard (e.g. HCP also uses integers only as subject index, although the bug likely wont happen there for different reasons). So i think getting as much "type safety" as possible here will be good nonetheless

LeSasse commented 5 months ago

Here's a relevant issue: https://stackoverflow.com/questions/16988526/pandas-reading-csv-as-string-type

The pandas type sniffer is the culprit from what I understand.

yes, i think np.loadtxt will be better here as far as i understand

fraimondo commented 5 months ago

We need to also be able to read elements as tuples, so basically, CSV files too

LeSasse commented 5 months ago

We need to also be able to read elements as tuples, so basically, CSV files too

np.loadtxt can do that, it has a delimiter parameter and reads in columns to have 2D arrays of a particular type as far as i know

LeSasse commented 5 months ago

might be worth testing out

synchon commented 5 months ago

Okay I tested np.loadtxt and it works correctly, will make a quick PR out of this.

synchon commented 5 months ago

Okay my bad, I overlooked the pandas version which we support, so I got it working with minimal change and using pandas. I would not want to change a lot just to not have unexpected changes before a release.

LeSasse commented 5 months ago

makes sense, if there is a way to make it work with pandas, then thats preferred i think

synchon commented 5 months ago

@LeSasse @fraimondo can you run your YAMLs with this branch and see if it works as expected?

github-actions[bot] commented 5 months ago

PR Preview Action v1.4.7 :---: Preview removed because the pull request was closed. 2024-04-05 14:44 UTC

LeSasse commented 5 months ago

@LeSasse @fraimondo can you run your YAMLs with this branch and see if it works as expected?

works for me now

codecov[bot] commented 5 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 100.00%. Comparing base (3a36ad0) to head (c284d76).

Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/juaml/junifer/pull/322/graphs/tree.svg?width=650&height=150&src=pr&token=5H21JuZXMw&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=juaml)](https://app.codecov.io/gh/juaml/junifer/pull/322?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=juaml) ```diff @@ Coverage Diff @@ ## main #322 +/- ## ========================================= Coverage 100.00% 100.00% ========================================= Files 1 1 Lines 1 1 ========================================= Hits 1 1 ``` | [Flag](https://app.codecov.io/gh/juaml/junifer/pull/322/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=juaml) | Coverage Δ | | |---|---|---| | [docs](https://app.codecov.io/gh/juaml/junifer/pull/322/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=juaml) | `100.00% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=juaml#carryforward-flags-in-the-pull-request-comment) to find out more.