hetio / het.io

Source code for https://het.io website
Other
6 stars 3 forks source link

header info in input files #1

Closed jcbarret closed 4 years ago

jcbarret commented 6 years ago

I'm looking at files at http://het.io/disease-genes/downloads/ and am wondering if there's a key to the headers of the different input files? For example, https://raw.githubusercontent.com/dhimmel/het.io-dag-data/d8028c8820322ae4ad7642998bccc3ee7318ff16/downloads/diseases.txt has columns HC-P, HC-S, LC-P, LC-S but I'm not sure what they are. Sorry if this is obvious somewhere, but I couldn't find it after some searching.

dhimmel commented 6 years ago

The S6 Data caption from the associated PLOS Computational Biology paper is slightly more helpful:

An extended version of Table 3 including all diseases with at least one GWAS-Catalog-extracted association. The manual pathophysiology classification is included.

The caption for Table 3 is:

Diseases. Associations were predicted for 29 diseases with at least 10 positives. For these diseases, the number of high-confidence primary (HC-P), high-confidence secondary (HC-S), low-confidence primary (LC-P), and low-confidence secondary associations (LC-S) that were extracted from the GWAS Catalog is indicated.

So hopefully that answers your questions regarding diseases.txt. See the Associations Method section for more about how disease-gene associations were extracted from the GWAS catalog and what HC-P, HC-S, LC-P, and LC-S mean.

Note that the files available at http://het.io/disease-genes/downloads/ are from our 2015 study to predict disease-associated genes. In general, most users will be interested in Hetionet v1.0, which is available at https://neo4j.het.io (is down right now, will fix) and at https://github.com/dhimmel/hetionet. This hetnet is descibed in our 2017 eLife study called Project Rephetio. This project has much more detailed supplementary methods, since we discussed all code and data on Thinklab while performing the project. For example, see this discussion for how we processed the GWAS Catalog to get gene-disease associations in Project Rephetio. We used a very similar method to what we did in the predecessor study that created diseases.txt mentioned above.

dhimmel commented 6 years ago

More generally, @jcbarret correctly points out an issue that the table columns are not very well documented for the files at http://het.io/disease-genes/downloads/. At this point, I don't have any immediate plans to fix this issue, but encourage users to post GitHub issues with any questions. At some point in the future, I'd like to revamp the het.io website and may address some of these issues then.

dhimmel commented 4 years ago

We're moving the downloads page for the disease-genes study to GitHub from https://het.io/disease-genes/downloads/.

The READMDE (pinned version) now shows the first two row of each table for more convenience. While columns are still not fully documented, I will close this for now. Happy to elaborate on column meanings as requested. As I note above, most users will probably be interested in the newer Hetionet data instead.