Closed maryjgoldman closed 5 years ago
Maybe more clean up on cases.project. We have the following fields:
Fields | Possible values |
---|---|
project.dbgap_accession_number | Always null for TCGA; but can be [phs000467, phs000465, phs000471, phs000468, phs000470, phs000466] for TARGET; useful? |
project.disease_type | A list |
project.name | Uterine Corpus Endometrial Carcinoma... |
project.primary_site | A list |
project.project_id | TCGA-UCEC... |
project.released | Always true? |
project.state | Always open? |
Only removing disease_type.project
for the time being https://github.com/yunhailuo/xena-GDC-ETL/pull/68.
Good call @ayan-b to stick with the known until Yunhai and I came to a conclusion.
As far as these other fields sounds like we need to remove them. Should we open a new github ticket? Or do it here? I'm fine with either. Whatever you prefer @ayan-b
Let's do it here.
@maryjgoldman Updated data in the hub.
@yunhailuo I am confused. I am looking at your list of fields above that you asked @ayan-b to take out. Many of them are still in there but when I look at them, they are not lists. Similarly, in the old GDC data they appear to not be lists too. Can you please give an example where they are a list?
Fields to investigate: project.disease_type project.name project.primary_site project.project_id
@maryjgoldman I have only removed primary_site.project
and disease_type.project
since we didn't reach a conclusion for the others.
That makes a lot of sense. @ayan-b can you investigate these fields to see if there is ever a time in which they are a list? If not, then we can leave them since they are in the older GDC data as well.
Fields to investigate: project.disease_type project.name project.primary_site project.project_id
@yunhailuo I am confused. I am looking at your list of fields above that you asked @ayan-b to take out. Many of them are still in there but when I look at them, they are not lists. Similarly, in the old GDC data they appear to not be lists too. Can you please give an example where they are a list?
Fields to investigate: project.disease_type project.name project.primary_site project.project_id
Sorry, I'm not saying they are lists. I just want to get clarifications and be sure about what to keep and what not.
Looks great. No fields that are lists in the TCGA data
I think we said to do this at some point but I'm still seeing this field in the files. It is a list and we don't want that. And the data in this field is already contained in the 'disease_type' field that we get from the xml files.