Data loader only supports a single `pcr_target`

bussec commented 3 years ago

pcr_target is defined as an array, which is necessary in situations in which multiple loci are amplified in parallel (e.g., single-cell experiments). The current data loader does not support this and creates the following warning:

Warning: Found a repertoire list for pcr_target > 1 (3).
Warning: iReceptor only supports a single array, using first instance.

bcorrie commented 3 years ago

Funny, I noticed this when I looked at your data... I saw one PCR Target and noticed there were multiple chains on the sequences page.

Which format are you using to load the Repertoire metadata? AIRR Repertoire or the iReceptor TSV format?

bcorrie commented 3 years ago

@bussec you are correct, we do not actually support this at the moment. In fact the iReceptor Turnkey data model uses a flattened, de-normalized data model and currently we are not able to load all AIRR Repertoire JSON files. We currently only load AIRR Repertoire files that have Repertoires with a single SampleProcessing and a single DataProcessing (as well as a single PCR Target) in each Repertoire.

We do this intentionally, so that each Repertoire typically contains a very specific set of annotations (a single locus, cell subset, pcr target etc.) which makes it easy for researchers to find and isolate data of a specific type. We essentially break down an experiment to its lowest Repertoire-level constituent parts when it is loaded into the database. I suppose this also means we expect you, the data curator, to break down your experiment in this way as well.

Note that it is still possible to group SampleProcessing and DataProcessing objects together to represent your experimental methodology as you see fit through using repertoire_id, sample_processing_id and data_processing_id. But on data loading we currently require that you use a separate Repertoire for each of these.

Notice that this methodology is inherent when you use the iReceptor TSV file for loading Repertoire metadata, and this is the format that we use internally to curate our Repertoire metadata. I see that this is not well documented on our AIRR Repertoire data loading. Sorry about that...

Also, ideally, our AIRR Repertoire data loader would be able to load ANY AIRR Repertoire file and transform that into the format that we require, but currently that is not implemented.

bcorrie commented 3 years ago

It looks to me like your data has IGH, IGK, and IGL. The suggested way to load this data currently would be to load this as three different Repertoires, each with the same repertoire_id but with different sample_processing_id, one for each locus. In that way, they are still grouped as a single Repertoire. Or alternately, if it is more experimentally accurate, you could group them with the same repertoire_id and sample_processing_id and then consider the splitting of the IGH/IGK/IGL data into three different groups as a DataProcessing step, and use data_processing_id to separate them...

bcorrie commented 3 years ago

I also note that in looking at the AIRR loading code, we have the following comment:

                # In the general case, iReceptor only supports a single instance in
                # array subtypes. If this occurs, we print a warning and use the first
                # element in the array and ignore the rest. This is a fairly substantial
                # issue and MAYBE it should be a FATAL ERROR???

I think that the comment is correct - in looking at what happened when you loaded data, this was a simple "Warning" but a significant and important part of your data was not loaded. This is a bug and should be fixed - when this happens it should refuse to do any data loading at all...

bcorrie commented 3 years ago

@bussec the extension above has been implemented on master branch for the dataloading-mongo https://github.com/sfu-ireceptor/dataloading-mongo/commit/d8d350563fd2121a93f64482256879c7d9b64e4b

For your Turnkey, you might want to consider running our "production-v4/turnkey-v4" branch version, this is our stable pre-release with all of our new features (but not yet released). This is essentially what we are running on all of our COVID-19 repositories, and has all of our new features such as the Stats API etc.

We are working on preparing this now, so it isn't quite ready, but you might want to consider it... It is a bit bleeding edge, but we expect it to be stable (since we will run it on our COVID-19 repositories which are in production).

If you want to be on the bleeding edge, we can show you how to run different branches in the containers. Not sure how this would be done on your non-docker compose setup.

bussec commented 3 years ago

Which format are you using to load the Repertoire metadata? AIRR Repertoire or the iReceptor TSV format?

AIRR Repertoire, so the respective section in the YAML is:

Repertoire:
[...]
  sample:
[...]
    pcr_target:
    - pcr_target_locus: IGH
      forward_pcr_primer_target_location: null
      reverse_pcr_primer_target_location: null
    - pcr_target_locus: IGK
      forward_pcr_primer_target_location: null
      reverse_pcr_primer_target_location: null
    - pcr_target_locus: IGL
      forward_pcr_primer_target_location: null
      reverse_pcr_primer_target_location: null

bussec commented 3 years ago

The suggested way to load this data [...]

As most of the experimental steps for the different loci run side-by-side in all single-cell protocols I am aware of, I would not like to split at the sample_processing level (*). Therefore using DataProcessing for splitting seems be better, although also this pretends that there is a step of sorting sequences based on their locus of origins during the primary data processing (while in reality the primary pipeline does not care about this and the sorting is performed during export).

We will try to implement this as an intermediate solution, but ultimately we need better solutions for this on the level of the Common Exchange Format. Right now we might create YAMLs with more Repertoire entries than actual cells in them :thinking:

() To be exact: Our own protocol does* produce locus specific libraries in the end, but this is not the case for RACE-based platforms like 10X.

schristley commented 3 years ago

It looks to me like your data has IGH, IGK, and IGL. The suggested way to load this data currently would be to load this as three different Repertoires, each with the same repertoire_id but with different sample_processing_id, one for each locus. In that way, they are still grouped as a single Repertoire. Or alternately, if it is more experimentally accurate, you could group them with the same repertoire_id and sample_processing_id and then consider the splitting of the IGH/IGK/IGL data into three different groups as a DataProcessing step, and use data_processing_id to separate them...

But neither are an exact provenance of what happened? I'm thinking about 10X specifically, though I'm curious if @bussec has a different protocol. In the first, the metadata suggests that there are three biological replicates that where sequenced, but 10X has a single primer mix with all the loci. And the later is confusing as it changes the semantics of DataProcessing in an incompatible way, so I don't think this should be done.

schristley commented 3 years ago

We will try to implement this as an intermediate solution, but ultimately we need better solutions for this on the level of the Common Exchange Format. Right now we might create YAMLs with more Repertoire entries than actual cells in them 🤔

Maybe it is possible to embed the JSON in the iReceptor field, so the structure isn't lost?

bcorrie commented 3 years ago

In the first, the metadata suggests that there are three biological replicates that where sequenced

Not sure I quite follow - why does having this represented as multiple SampleProcessing objects imply biological replicates? If all fields are the same except the PCRTarget object, this doesn't imply a different biological sample (which is my understanding of biological replicate)? That is the role of the sample_id.

And the later is confusing as it changes the semantics of DataProcessing in an incompatible way, so I don't think this should be done.

Does it change the semantics? In fact, I thought the semantics of DataProcessing were quite wide and vague (at least for now 8-) If the SampleProcessing is performed to produce a single file with a bunch of sequences - this then moves to data processing. The sequences are then annotated, and finally they file is split in to three different files based on the locus for sharing, that seems like a perfectly logical use for the DataProcessing object. We do this all the time as pasrt of our curation process - we want our data groupings to be as "atomic" as possible in terms of the type of data they store...

bcorrie commented 3 years ago

We will try to implement this as an intermediate solution, but ultimately we need better solutions for this on the level of the Common Exchange Format. Right now we might create YAMLs with more Repertoire entries than actual cells in them 🤔

Maybe it is possible to embed the JSON in the iReceptor field, so the structure isn't lost?

Yes, we are working on this - as this is required for us to do Single Cell, Germline, etc. And it would solve our PCRTarget problem in this case so that @bussec doesn't have to change his representation.

With that said, I am not sure that what I am suggesting above is necessarily incorrect in terms of representation of the data. Different from the exact structure that @bussec originally used, true, but I don't think it is wrong (I am sure you will correct me if it is wrong 8-)...

schristley commented 3 years ago

In the first, the metadata suggests that there are three biological replicates that where sequenced

Not sure I quite follow - why does having this represented as multiple SampleProcessing objects imply biological replicates? If all fields are the same except the PCRTarget object, this doesn't imply a different biological sample (which is my understanding of biological replicate)? That is the role of the sample_id.

Sorry, I meant technical replicates... But I was thinking "suggests" versus something stronger as it does seem to work. It's ambiguous why it's split but at least all of the sequences are under one repertoire and data processing.

And the later is confusing as it changes the semantics of DataProcessing in an incompatible way, so I don't think this should be done.

Does it change the semantics? In fact, I thought the semantics of DataProcessing were quite wide and vague (at least for now 8-) If the SampleProcessing is performed to produce a single file with a bunch of sequences - this then moves to data processing. The sequences are then annotated, and finally they file is split in to three different files based on the locus for sharing, that seems like a perfectly logical use for the DataProcessing object. We do this all the time as pasrt of our curation process - we want our data groupings to be as "atomic" as possible in terms of the type of data they store...

Yes, we explicitly say not to do this in the paper and doc.

franasa commented 3 years ago

The suggested way to load this data currently would be to load this as three different Repertoires, each with the same repertoire_id but with different sample_processing_id, one for each locus. In that way, they are still grouped as a single Repertoire.

Thanks, @Brian, we have provisionally implemented your solution!

I just wanted to comment that this required us to split the TSV files by locus as well, so each rearrangement file became three in order to have a unique IG target (multiple rearrangement files can not point to the same repertoire.). Once we did that, pcr_target was not anymore an array so the airr.load_repertoire function raised an error for not having an array. I thought this is noteworthy since the turn-key expects then an array of a single value. Also, I think is not yet implemented but collection_time_point_relative should be an integer with an ontology as a unit.

bcorrie commented 3 years ago

@franasa great...

I just wanted to comment that this required us to split the TSV files by locus as well, so each rearrangement file became three in order to have a unique IG target (multiple rearrangement files can not point to the same repertoire.).

Yes, that is what I would expect... Note that from a user of your data perspective, this makes it easier for someone to use that data. For example, if someone is interested in only IGH, without this step, it is not possible for them to get just the IGH data. They will always get a repertoire with mixed locus. We prefer to split things up in this way, as logically it is easy to reconstruct and download the mixed locus data for the single sample but hard to split it apart if you are only interested in one locus.

Once we did that, pcr_target was not anymore an array so the airr.load_repertoire function raised an error for not having an array. I thought this is noteworthy since the turn-key expects then an array of a single value.

The data loader does expect the type of each field to be correct, so it would enforce an array, and in this case an array of a single element.

Also, I think is not yet implemented but collection_time_point_relative should be an integer with an ontology as a unit.

Yes, you are correct, the current iReceptor Turnkey repository is still using the AIRR v1.3 primarily, with added v1.4 fields. It does not implement any changes that will break backwards compatibility, and changing a field from a string to an ontology breaks compatibility. When 1.4 is finalized, we will be making that change. Our Turnkey v4.0, which is currently beta, will have these changes in it soon.

bcorrie commented 3 years ago

@franasa is it OK to close this issue?

franasa commented 3 years ago

with @bussec's blessing, yes!

bcorrie commented 3 years ago

Wouldn't want it any other way 8-)

sfu-ireceptor / dataloading-mongo

Data loader only supports a single `pcr_target` #45