Open bswhite opened 4 years ago
Thanks, @bswhite! I'll take a closer look.
And yes, I recognize that all the scripts I keep asking about apparently live in the repo for the package I created...
As far as I can tell, sex is embedded in the characteristics GEO field -- though I don't see that this is required and the means of doing so doesn't appear to be standardized. That said, from a few examples, it seems to follow the form sex: m sex: Female gender: Male gender: f i.e., can use either "sex: " or "gender: ", use "m"/"f" or "male"/"female", and be capitalized or not.
Table 1 of this publication lists a bunch of GEO datasets with male/female annotated: https://link.springer.com/article/10.1007/s00204-015-1632-4#Tab1 This list may be biased in the way that sex is specified. But it make also give alternate ways to specify sex.
I have included 3 examples from this table, each of which was generated by a command line: Rscript ./get-geo-annotations.R --gse=GSE19188 > GSE19188-metadata.tsv Evidently, I can't attach tsv's here. Blah.
I suggest we just grep/pattern match for these common cases -- we don't have to catch all datasets. Let's just catch the common cases.
Here are a few examples:
$ more GSE19188-metadata.tsv | cut -f2 | head -3 characteristics_ch1 tissue type: tumor;cell type: LCC;overall survival: 12.5;status: deceased;gender: M tissue type: healthy;cell type: healthy;overall survival: Not available;status: Not available;gender: Not available
$ more GSE14814-metadata.tsv | cut -f2 | head -3 characteristics_ch1 tissue: primary lung cancer;Post Surgical Treatment: OBS;Stage: II;age: 44.9;Sex: Female;Cause of death: Alive;Histology type: ADC;OS time: 8.52;OS status: Alive;DSS time: 8.52;DSS status: Alive;predominant subtype: Acinar tissue: primary lung cancer;Post Surgical Treatment: OBS;Stage: I;age: 53.4;Sex: Male;Cause of death: Alive;Histology type: SQCC;OS time: 9.03;OS status: Alive;DSS time: 9.03;DSS status: Alive;predominant subtype: not applicable
more GSE33113-metadata.tsv | cut -f2 | head -3 characteristics_ch1 disease status: AJCC stage II CRC;tissue: primary tumor resection;age at diagnosis: 41,6;Sex: m;meta or recurrence within 3 years: no;time to meta or recurrence: 2000 disease status: AJCC stage II CRC;tissue: primary tumor resection;age at diagnosis: 66,06;Sex: m;meta or recurrence within 3 years: no;time to meta or recurrence: 140
The get-geo-annotations.R script here: https://github.com/Sage-Bionetworks/syndccutils/blob/master/R/scripts/get-geo-annotations.R
extracts links on a per-sample/per-file basis to BioSample. e.g., the following is returned by that script for GSE109089
geo_accession relation GSM2931519 BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN08354877; SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX3554420 GSM2931520 BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN08354876; SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX3554421 ...
Now, can we backtrack and get the BioSample "dataset" associated with SAMN08354877 and SAMN08354876?
Nothing comes up when I google "SAMN08354876"