Open tilschaef opened 3 years ago
Okay I made some changes and more examples for how I think the tables will look like in the different situations. Hope this is a clear overview :)
This is indeed a way other researchers more often save their data (Gert Jan confirmed this is being used). Only thing people need to be cautious of, is have the exact same amounts of variable fields in their cell names for different samples (otherwise they can always opt for option 3).
For the meta data table in .tab/.csv format. Indeed the important thing will be the matching to the cells in de database, so people should use the names present in the kallistobus/
output folder for a Sample
column as well, right? Which indeed will have a first field Genome. (The last field: well/cell-id, will also be there, so we have to make sure the column names can still be matched).
Also I have discussed this further in the group meeting last week, and there are 2 different layout scenarios for these meta data files:
2.1. Sample meta data Especially for droplet based this is needed (but could be used in plate based), a table could be added that contains the info per sample:
Sample | Genome | Library | Timepoint | Treatment |
---|---|---|---|---|
GRCh38_820_d0 | GRCh38 | 820 | d0 | no |
GRCh38_821_d1_treated_A1 | GRCh38 | 821 | d1 | yes |
GRCh38_821_d3 | GRCh38 | 821 | d3 | no |
And this meta data should be spread over/connected to the individual cells in the samples within our script, before it can be added to a SCE-Seurat object.
(I think this should be possible when matching on a column name with the last _id
removed?)
2.2. Cell/Well meta data For people working with plates, sorted with different samples into 1 plate (often the case: for preventing the technical variation to be coupled to the samples). In this case, the user has to make sure they provide for meta data for each cell of the dataset, which in the script can then easily be matched. (This is never the case for droplet-based methods, unless people have metadata per barcode from a pre-analysed dataset)
Sample | Genome | Library | Timepoint |
---|---|---|---|
GRCh38_820_d0_A1 | GRCh38 | 820 | d0 |
GRCh38_820_d0_A2 | GRCh38 | 820 | d0 |
GRCh38_820_d0_A3 | GRCh38 | 820 | d3 |
GRCh38_820_d0_A4 | GRCh38 | 820 | d3 |
... | ... | ... | ... |
GRCh38_821_d1_treated_A1 | GRCh38 | 821 | d0 |
_
in the columnnames, since we know this was added by the pipeline itself, and thereby remove the barcode and use the rest as Library
? (Then people don't have to specify where the barcode is, because by design, we add it last.) I think that would be the easiest way to create 1 meta data column for visualizations, if people do not provide any. Also, if people find the extraction of meta fields difficult, they do not have to think about any _
in the names, and everything will run anyway.
(Unless you want people to be able to shorten the identifier of course, but I can imagine letting people specify the position of the Library
in their column names, will require some trail and error?) Library | Well-id* |
---|---|
GRCh38_820_d0 | A1 |
... | ... |
GRCh38_821_d1_treated | A1 |
... | ... |
GRCh38_822_d3 | A2 |
*The well-id column is optional I think.
I explicitly used _
in all these examples in the names, because I think in all of these options this should not be an issue. Whereas the first option blindly trust the good use of those, to split up the names. But this is really something the user should just pay attention to.
For now, I see 3 different scenarios for a user to provide meta-data.
extract_meta_columns
. Possibly the most convenient way.sample_well(barcode)
. In this case, we could say the identity is always the first field after splitting by_.
In case sample itself has multiple_,
we could ask the user to specify the cell identity index which becomes 1+ position in sample.