pepkit / peppy

Project metadata manager for PEPs in Python
https://pep.databio.org/peppy
BSD 2-Clause "Simplified" License
37 stars 13 forks source link

sample_modifiers applied to subsamples #435

Closed johnyaku closed 1 year ago

johnyaku commented 1 year ago

When a subsample_table is merged into the sample_table the new columns resulting from the merge are list columns. This makes sense, as part of the motivation for subsample tables is to cater for situations where there a single sample has multiple values for a given property (column).

However, applying sample_modifiers such as derive breaks on these list columns.

More natural behaviour would be to iteratively apply the sample modifier rules to all elements in the list. Alternatively, perhaps single item lists could be "unlisted" to become normal columns, and thus subject to sample_modifiers.

Not sure if this a feature request or a "how to" question...

johnyaku commented 1 year ago

The motivation for wanting to be able to do this relates to issue #434.

Specifically, I want to be able to separate sample "properties" from "data paths". I was hoping that putting one or the other in a subsample table would do the trick, but the single item lists are breaking things.

If this is actually a feature request (ie, there is no easy way to do this atm) then can I propose the idea of a "merge table", as similar to a subsample table in that it gets merged into the sample_table but where the expectation is that there will be no more than one row per sample.

johnyaku commented 1 year ago

Digging deeper, it looks like that this is a feature request.

But a simple implementation (rather than introducing a merge_table item) would be to allow the specification of more than one sample_table as a list, as can now be done with subsample_table. There would need to be a common index, of course, but the tables could then be merged.

nsheff commented 1 year ago

This makes a lot of sense to me, and I like the idea. I also agree that just making sample_table accept either a single table or a list of tables would give you the machinery that you need to separate your sample tables into two. It also wouldn't be too hard to implement.

But as I think about this more, I'm not certain this is better than the way I've typically approached this in the past: which is to use a combined append/derived attribute. With this approach, I simply don't put the "data paths" into the sample table at all, I only put them into the project config file. You can do 1) use append to create a new column; and then 2) use derive to turn it into a path.

This way, the paths are solely encoded in derived attribute modifiers, and the sample table has no paths. I discuss this a little bit here:

https://pep.databio.org/en/latest/howto_eliminate_paths/

You can make it even more powerful in this example by not including the file_path in the sample table, but adding it in using append (or even using imply if it varies across the table).

johnyaku commented 1 year ago

Thanks! Your suggestion works beautifully.

For anyone stumbling on this thread, my toy example looks like this:

sample_name,slide,area
CID_1234,V11Y24-111,A1
CID_4321,V11Y24-111,B1

.. saved as samples.csv. Then the config file looks like this:

name: peptest

pep_version: 2.1.0
sample_table: "samples.csv"

sample_modifiers:
    append:
        image: "_img"
        fastq: "_fq"
    derive:
        attributes: [image,fastq]
        sources:
            _img: data/{slide}_{area}.jpg
            _fq: data/{sample_name}.fastq.gz

If I get time I'll make a PR to extend the documentation to show how to compbine appened and derive. The linked document above (which actually sent me down this line of thought) uses imply plus derive. I guess if I was more familar with the PEP specification the extension to append plus derive would have been obvious, but it might still be worth spelling it out for slow-pokes like me!

johnyaku commented 1 year ago

This is working great when the sample table is indexed by sampe_name, but at least in some cases it fails with a non-standard index.

Here is my reprex ...

First, samples.csv ...

CAPTURE_ID,SAMPLE_ID
BH_pool1,CT_1126
BH_pool1,CT_1128
BH_pool1,CT_1108
BH_pool2,CT_167
BH_pool2,CT_1134
BH_pool2,CT_174
BH_pool3,CT_1194
BH_pool3,CT_1155
BH_pool3,CT_1152
BH_pool4,CT_1135
BH_pool4,CT_1185
BH_pool4,CT_1209
BH_pool5,CT_1147
BH_pool5,CT_1178
BH_pool5,CT_1157
BH_pool6,CT_1206
BH_pool6,CT_1205
BH_pool6,CT_1227
BH_pool8,CT_1213
BH_pool7,CT_1210
BH_pool7,CT_1201
BH_pool7,CT_1200
BH_pool8,CT_1203
BH_pool8,CT_1216
BH_pool8,CT_1214

.. and my config.yaml ...

name: reprex

pep_version: 2.1.0
sample_table: samples.csv
sample_table_index: SAMPLE_ID

sample_modifiers:
  append:
    cell_barcode_file: _bc
    cell_bam_file: _bam
    genotype_cel_file: _cel
  derive:
    attributes: [cell_barcode_file,cell_bam_file,genotype_cel_file]
    sources:
      _bc  : data/counts/{CAPTURE_ID}/GE/{CAPTURE_ID}/outs/per_sample_outs/{CAPTURE_ID}/count/sample_feature_bc_matrix/barcodes.tsv.gz
      _bam : data/counts/{CAPTURE_ID}/GE/{CAPTURE_ID}/outs/per_sample_outs/{CAPTURE_ID}/count/sample_alignments.bam
      _cel : data/snp/SNP_{SAMPLE_ID}.CEL

I then try to load the sample sheet with ...

import peppy
pep=peppy.Project("config.yaml")

... and I get the following traceback:

Traceback (most recent call last):
  File "/path/to/lib/python3.10/site-packages/attmap/pathex_attmap.py", line 39, in __getattr__
    v = super(PathExAttMap, self).__getattribute__(item)
AttributeError: 'Sample' object has no attribute 'sample_name'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/path/to/lib/python3.10/site-packages/attmap/ordattmap.py", line 46, in __getitem__
    return super(OrdAttMap, self).__getitem__(item)
KeyError: 'sample_name'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/path/to/lib/python3.10/site-packages/attmap/pathex_attmap.py", line 42, in __getattr__
    return self.__getitem__(item, expand)
  File "/path/to/lib/python3.10/site-packages/attmap/pathex_attmap.py", line 59, in __getitem__
    v = super(PathExAttMap, self).__getitem__(item)
  File "/path/to/lib/python3.10/site-packages/attmap/ordattmap.py", line 48, in __getitem__
    return AttMap.__getitem__(self, item)
  File "/path/to/lib/python3.10/site-packages/attmap/attmap.py", line 32, in __getitem__
    return self.__dict__[item]
KeyError: 'sample_name'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/path/to/pep-reprex/./pep_check.py", line 20, in <module>
    pep=peppy.Project(config_file)
  File "/path/to/lib/python3.10/site-packages/peppy/project.py", line 160, in __init__
    self.create_samples(modify=False if self[SAMPLE_TABLE_FILE_KEY] else True)
  File "/path/to/lib/python3.10/site-packages/peppy/project.py", line 259, in create_samples
    self.modify_samples()
  File "/path/to/lib/python3.10/site-packages/peppy/project.py", line 438, in modify_samples
    self.attr_derive()
  File "/path/to/lib/python3.10/site-packages/peppy/project.py", line 830, in attr_derive
    "Deriving '{}' attribute for '{}'".format(attr, sample.sample_name)
  File "/path/to/lib/python3.10/site-packages/attmap/pathex_attmap.py", line 46, in __getattr__
    raise AttributeError(item)
AttributeError: sample_name

If I rename the SAMPLE_ID column in sampes.csv to sample_name and then use the derive modifier to make SAMPLE_ID from sample_name then it works, but that kind of defeats the purpose of sample_table_index, no?

nsheff commented 1 year ago

Indeed. in fact I think to use it with subsamples you will also need to specify subsample_table_index in addition to sample_table_index...

johnyaku commented 1 year ago

Thanks @nsheff

Can you please elaborate a little?

Here's an even simpler reprex for you...

Take this example. When I try to load this with peppy.Project() everything works as advertised.

Now add the following:

sample_modifiers:
  append:
    a: x
  derive:
    attributes: [a]
    sources:
      x: y/{sample_id}

Load this with peppy.Project() I get the following traceback:

Traceback (most recent call last):
  File "/path/to/lib/python3.11/site-packages/attmap/pathex_attmap.py", line 39, in __getattr__
    v = super(PathExAttMap, self).__getattribute__(item)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Sample' object has no attribute 'sample_name'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/path/to/lib/python3.11/site-packages/attmap/ordattmap.py", line 46, in __getitem__
    return super(OrdAttMap, self).__getitem__(item)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'sample_name'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/path/to/lib/python3.11/site-packages/attmap/pathex_attmap.py", line 42, in __getattr__
    return self.__getitem__(item, expand)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/lib/python3.11/site-packages/attmap/pathex_attmap.py", line 59, in __getitem__
    v = super(PathExAttMap, self).__getitem__(item)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/lib/python3.11/site-packages/attmap/ordattmap.py", line 48, in __getitem__
    return AttMap.__getitem__(self, item)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/lib/python3.11/site-packages/attmap/attmap.py", line 32, in __getitem__
    return self.__dict__[item]
           ~~~~~~~~~~~~~^^^^^^
KeyError: 'sample_name'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/path/to/lib/python3.11/site-packages/peppy/project.py", line 163, in __init__
    self.create_samples(modify=False if self[SAMPLE_TABLE_FILE_KEY] else True)
  File "/path/to/lib/python3.11/site-packages/peppy/project.py", line 262, in create_samples
    self.modify_samples()
  File "/path/to/lib/python3.11/site-packages/peppy/project.py", line 441, in modify_samples
    self.attr_derive()
  File "/path/to/lib/python3.11/site-packages/peppy/project.py", line 839, in attr_derive
    "Deriving '{}' attribute for '{}'".format(attr, sample.sample_name)
                                                    ^^^^^^^^^^^^^^^^^^
  File "/path/to/lib/python3.11/site-packages/attmap/pathex_attmap.py", line 46, in __getattr__
    raise AttributeError(item)
AttributeError: sample_name

Rather than sample.sample_name perhaps this should be sample.st_index? I don't understand the code base well enough yet to offer a PR, but this looks like a bug to me.

johnyaku commented 1 year ago

FWIW, ammending project_config.yaml to the following kind of "works", but is more than a little clunky.

pep_version: "2.1.0"
sample_table: sample_table.csv
sample_table_index: sample_name

sample_modifiers:
  append:
    sample_id: sid
    a: x
  derive:
    attributes: [sample_id, a]
    sources:
      sid: "{sample_name}"
      x: y/{sample_id}

For context, our motivation for not simply using sample_name for the index is that we frequently multiplex samples into a "pool" which is sampled together and then demultiplexed back into individual samples. Some of our workflows operate on pools, whereas others operate on individual samples. We would like to use the term "sample" consistently to refer to the individual biological samples, as distinct from a "pool" or "capture". Being forced to use sample_name to index a table of "pools" breaks this semantic consistency.

nsheff commented 1 year ago

Yes, you found a bug, and I have fixed it locally. Will push soon. Thanks!

johnyaku commented 1 year ago

Thanks! :)

nsheff commented 1 year ago

Try version 0.35.5 (now released) and see if this solves it for you.

johnyaku commented 1 year ago

Thanks Nathan, I have tested this on my original reprex and it solves the problem.

Thanks for fixing this so quickly!