UniformData.py cannot import name 'run_find_by_type'

gibberwocky commented 3 hours ago

ImportError: cannot import name 'run_find_by_type' from 'seqspec.seqspec_find' (/uoa/scratch/users/s14dw4/.conda/envs/cellatlas_fork/lib/python3.7/site-packages/seqspec/seqspec_find.py)

The latest version of seqspec does not have a run_find_by_type() function in seqspec_find.py. It does have a run_find() function, which accepts the same arguments + idtype, and calls the relevant find_by_xxx function based on idtype. Comparing the two seqspec_find.py files, Claude suggests that run_find_by_type() has been replaced with find_by_region_type() and run_find() is now find_by_region_id(). This implies that we should either call find_by_region_type() directly, or call run_find(..., idtype="region-type").

gibberwocky commented 2 hours ago

run_find() also requires the argument o which appears to be an output file for writing the YAML:

    # post processing
    if o:
        with open(o, "w") as f:
            yaml.dump(found, f, sort_keys=False)
    else:
        print(yaml.dump(found, sort_keys=False))

gibberwocky commented 2 hours ago

The first argument to run_find() needs to be self.seqspec_fn rather than self.seqspec, as run_find() executes spec = load_spec(spec_fn) which is already executed in class UniformData prior to calling run_find(). There's a bit of redundancy there, but it's the path of least resistance.

Implementing the above changes results in the following error:

Traceback (most recent call last):
  File "/uoa/scratch/users/s14dw4/.conda/envs/cellatlas_fork/bin/cellatlas", line 33, in <module>
    sys.exit(load_entry_point('cellatlas', 'console_scripts', 'cellatlas')())
  File "/uoa/home/s14dw4/repos/cellatlas/cellatlas/main.py", line 45, in main
    COMMAND_TO_FUNCTION[sys.argv[1]](parser, args)
  File "/uoa/home/s14dw4/repos/cellatlas/cellatlas/cellatlas_build.py", line 106, in validate_build_args
    outputs[0],
  File "/uoa/home/s14dw4/repos/cellatlas/cellatlas/UniformData.py", line 49, in __init__
    relevant_fqs = [rgn.parent_id for rgn in rgns]
TypeError: 'NoneType' object is not iterable

Which indicates that the result of:

rgns = run_find(self.seqspec_fn, 
     self.modality, 
     MOD2FEATURE.get(self.modality.upper(), ""),  
     idtype="region-type", 
     o="")

is an empty list.

gibberwocky commented 2 hours ago

run_find() has four possible idtype values region-type, region, read, file. Only region prints yaml.dump (ie is !None).

- !Region
  region_id: cDNA
  region_type: cdna
  name: cDNA
  sequence_type: random
  sequence: X
  min_len: 1
  max_len: 150
  onlist: null
  regions: null
  parent_id: R1.fastq.gz

However, the resulting rgns list is empty.

gibberwocky commented 2 hours ago

Changing from run_find() to directly calling find_by_region_id() overcomes this issue:

rgns = find_by_region_id(
            self.seqspec, self.modality, MOD2FEATURE.get(self.modality.upper(), "")
        )

Leading to next error:

TypeError: run_index() missing 4 required positional arguments: 'idtype', 'rev', 'subregion_type', and 'o'

Which relates to changes to run_index() in seqspec_index.py which now requires more parameters:

def run_index(
    spec_fn,
    modality,
    ids,
    idtype,
    fmt,
    rev,
    subregion_type,
    o,
):

Than previously passed:

self.x_string = run_index(self.seqspec, self.modality, rids_in_spec, fmt="kb")

gibberwocky / cellatlas

UniformData.py cannot import name 'run_find_by_type' #1