ucscXena / xena-GDC-ETL

Extract, transform and load GDC data onto UCSC Xena
Apache License 2.0
12 stars 8 forks source link

Remove disease_type.project from all phenotype files #64

Closed maryjgoldman closed 5 years ago

maryjgoldman commented 5 years ago

I think we said to do this at some point but I'm still seeing this field in the files. It is a list and we don't want that. And the data in this field is already contained in the 'disease_type' field that we get from the xml files.

yunhailuo commented 5 years ago

Maybe more clean up on cases.project. We have the following fields:

Fields Possible values
project.dbgap_accession_number Always null for TCGA; but can be [phs000467, phs000465, phs000471, phs000468, phs000470, phs000466] for TARGET; useful?
project.disease_type A list
project.name Uterine Corpus Endometrial Carcinoma...
project.primary_site A list
project.project_id TCGA-UCEC...
project.released Always true?
project.state Always open?
ayan-b commented 5 years ago

Only removing disease_type.project for the time being https://github.com/yunhailuo/xena-GDC-ETL/pull/68.

maryjgoldman commented 5 years ago

Good call @ayan-b to stick with the known until Yunhai and I came to a conclusion.

As far as these other fields sounds like we need to remove them. Should we open a new github ticket? Or do it here? I'm fine with either. Whatever you prefer @ayan-b

ayan-b commented 5 years ago

Let's do it here.

ayan-b commented 5 years ago

@maryjgoldman Updated data in the hub.

maryjgoldman commented 5 years ago

@yunhailuo I am confused. I am looking at your list of fields above that you asked @ayan-b to take out. Many of them are still in there but when I look at them, they are not lists. Similarly, in the old GDC data they appear to not be lists too. Can you please give an example where they are a list?

Fields to investigate: project.disease_type project.name project.primary_site project.project_id

ayan-b commented 5 years ago

@maryjgoldman I have only removed primary_site.project and disease_type.project since we didn't reach a conclusion for the others.

maryjgoldman commented 5 years ago

That makes a lot of sense. @ayan-b can you investigate these fields to see if there is ever a time in which they are a list? If not, then we can leave them since they are in the older GDC data as well.

Fields to investigate: project.disease_type project.name project.primary_site project.project_id

yunhailuo commented 5 years ago

@yunhailuo I am confused. I am looking at your list of fields above that you asked @ayan-b to take out. Many of them are still in there but when I look at them, they are not lists. Similarly, in the old GDC data they appear to not be lists too. Can you please give an example where they are a list?

Fields to investigate: project.disease_type project.name project.primary_site project.project_id

Sorry, I'm not saying they are lists. I just want to get clarifications and be sure about what to keep and what not.

maryjgoldman commented 5 years ago

Looks great. No fields that are lists in the TCGA data