ucscXena / xena-GDC-ETL

Extract, transform and load GDC data onto UCSC Xena
Apache License 2.0
12 stars 8 forks source link

Verify and then remove _EVENT and _TIME_TO_EVENT in the survival data files #59

Closed maryjgoldman closed 5 years ago

maryjgoldman commented 5 years ago

To calculate survival, you need two columns. Right now there are two pairs of columns in the survival data files:

  1. _EVENT and _TIME_TO_EVENT
  2. _OS and _OS_IND

_EVENT and _TIME_TO_EVENT should be identical to _OS and _OS_IND (i.e. _EVENT= _OS_IND and _OS = _TIME_TO_EVENT). At the time we did the old GDC data we were deprecating _EVENT and _TIME_TO_EVENT to the more precise names of _OS and _OS_IND. However, we still kept _EVENT and _TIME_TO_EVENT to be backward compatible with older Xena Browser releases. At this point in time it has been long enough that we do not need to be backward compatible any more.

To do:

  1. check that _EVENT= _OS_IND and _OS = _TIME_TO_EVENT, either via the code or via a check on the data itself
  2. Remove _EVENT and _TIME_TO_EVENT from all survival data files

Close this issue when these changes are on the hub, ready for QA.

yunhailuo commented 5 years ago
  1. check that _EVENT= _OS_IND and _OS = _TIME_TO_EVENT, either via the code or via a check on the data itself

Yes. https://github.com/yunhailuo/xena-GDC-ETL/blob/master/xena_gdc_etl/xena_dataset.py#L1821-L1823

  1. Remove _EVENT and _TIME_TO_EVENT from all survival data files

@ayan-b ~I think replacing these two lines with rename should be enough: https://github.com/yunhailuo/xena-GDC-ETL/blob/master/xena_gdc_etl/xena_dataset.py#L1822-L1823~ Sorry. Wrong line. Rename here: https://github.com/yunhailuo/xena-GDC-ETL/blob/master/xena_gdc_etl/xena_dataset.py#L1816-L1817 And probably need to keep map(int) below.

maryjgoldman commented 5 years ago

There is a problem with the Xena Browser around this. Not sure why but the browser is not recognizing the columns. Running this by Brian and Jing to see what we should do. May need to revert if we don't have the engineering time to fix the Xena Browser code ... :( :(

maryjgoldman commented 5 years ago

So, Jing figured it out. The names of the fields are wrong. Need to rename. _OS -> OS.time _OS_IND -> OS

Please rename and reload. Can do just one cohort if you want or if it's easy, do all of them

ayan-b commented 5 years ago

@maryjgoldman Updated Survival data for all the cohorts.

maryjgoldman commented 5 years ago

As far as I can tell this looks good. However, I will not be able to finish my QA until the extra samples that do not have any genomic data (the -Z) #63 is done

maryjgoldman commented 5 years ago

Removed the -01Z samples manually and finished the QA. This looks good.