tilschaef / scRNA-seq

From fastq to preprocessed counttable (for in-house CELSeq2 method), with Kallisto | Bustools workflow.
0 stars 0 forks source link

Meta-data in general #15

Open tilschaef opened 3 years ago

tilschaef commented 3 years ago

For now, I see 3 different scenarios for a user to provide meta-data.

  1. Cell names contain meta-data variables. In this case, all the fields are directly extracted based on extract_meta_columns. Possibly the most convenient way.
  2. The user provides custom meta data in .tab/.csv delimited format. In this case, we would minimally require a genome and a sample library column to match the meta-data entries will cell identifiers. The library column or additional columns specified can be used for visualization and grouping in PCA, UMAP etc.
Example: Genome Library
GRCh38 820
  1. The users specifies no meta data. In this case, we have to group on library since it is the only information we can infer from cell names (if situation 1 does not apply). We could do something similar to Seurat's CreateSeurat object function where you can specify an identity class for each cell based on the cell name syntax. For example, in our case the cell names have the formatsample_well(barcode). In this case, we could say the identity is always the first field after splitting by_. In case sample itself has multiple_, we could ask the user to specify the cell identity index which becomes 1+ position in sample.
Rebecza commented 3 years ago

Okay I made some changes and more examples for how I think the tables will look like in the different situations. Hope this is a clear overview :)

  1. This is indeed a way other researchers more often save their data (Gert Jan confirmed this is being used). Only thing people need to be cautious of, is have the exact same amounts of variable fields in their cell names for different samples (otherwise they can always opt for option 3).

  2. For the meta data table in .tab/.csv format. Indeed the important thing will be the matching to the cells in de database, so people should use the names present in the kallistobus/ output folder for a Sample column as well, right? Which indeed will have a first field Genome. (The last field: well/cell-id, will also be there, so we have to make sure the column names can still be matched).

Also I have discussed this further in the group meeting last week, and there are 2 different layout scenarios for these meta data files:

2.1. Sample meta data Especially for droplet based this is needed (but could be used in plate based), a table could be added that contains the info per sample:

Sample Genome Library Timepoint Treatment
GRCh38_820_d0 GRCh38 820 d0 no
GRCh38_821_d1_treated_A1 GRCh38 821 d1 yes
GRCh38_821_d3 GRCh38 821 d3 no

And this meta data should be spread over/connected to the individual cells in the samples within our script, before it can be added to a SCE-Seurat object. (I think this should be possible when matching on a column name with the last _id removed?)

2.2. Cell/Well meta data For people working with plates, sorted with different samples into 1 plate (often the case: for preventing the technical variation to be coupled to the samples). In this case, the user has to make sure they provide for meta data for each cell of the dataset, which in the script can then easily be matched. (This is never the case for droplet-based methods, unless people have metadata per barcode from a pre-analysed dataset)

Sample Genome Library Timepoint
GRCh38_820_d0_A1 GRCh38 820 d0
GRCh38_820_d0_A2 GRCh38 820 d0
GRCh38_820_d0_A3 GRCh38 820 d3
GRCh38_820_d0_A4 GRCh38 820 d3
... ... ... ...
GRCh38_821_d1_treated_A1 GRCh38 821 d0
  1. What I was thinking here, we can select just the last occuring _ in the columnnames, since we know this was added by the pipeline itself, and thereby remove the barcode and use the rest as Library? (Then people don't have to specify where the barcode is, because by design, we add it last.) I think that would be the easiest way to create 1 meta data column for visualizations, if people do not provide any. Also, if people find the extraction of meta fields difficult, they do not have to think about any _ in the names, and everything will run anyway. (Unless you want people to be able to shorten the identifier of course, but I can imagine letting people specify the position of the Library in their column names, will require some trail and error?)
Library Well-id*
GRCh38_820_d0 A1
... ...
GRCh38_821_d1_treated A1
... ...
GRCh38_822_d3 A2

*The well-id column is optional I think.

I explicitly used _ in all these examples in the names, because I think in all of these options this should not be an issue. Whereas the first option blindly trust the good use of those, to split up the names. But this is really something the user should just pay attention to.