statgen / pheweb

A tool to build a website to browse hundreds or thousands of GWAS.
MIT License
154 stars 65 forks source link

6 issues/requests for PheWEB #170

Closed jielab closed 5 months ago

jielab commented 2 years ago

Dear Peter and guys:

  1. I think these days most GWAS don’t have REF and ALT that “must match reference genome” (required by pheweb). Also, I think in the long run, this idea of “REF” allele in the reference genome should be completely abandoned. How could a single genome serve as THE “reference genome” for billions of people in the world? If we agree on this, apparently, there would be NO more “ref” allele since the “ref” could be anything of the A, C, G, T. The genomic community realized this issue and new references are generated based on many genomes, not based on a single genome. I make this point because this is a big problem for pheweb. For example, I found that pheweb could NOT find the rs-ids for many SNPs in my GWAS. There are almost 700,000,000 records in the rsids-v154-hg19.tsv.gz file. So, it should be very rare that my GWAS SNP is not included in this 700,000,000 list. I checked and found this is due to the allele swap in my GWAS file. Is there a quick fix/walkaround so that pheweb treats “chr1 123 A C” the same as “chr1 123 C A”? Can I make a copy of the rsids-v154-hg19.tsv.gz file, swap the REF and ALT alleles, and then append it to the original file, so that now both “chr1 123 A C” and “chr1 123 C A” are there? Also, the output file in the “pheno_gz” folder only has a fixed number of fields. Is there a way to include all fields in my original GWAS file?

  2. I think pheweb’s “add-rsids” function is really powerful. It would be a big pity that I could not use this feature because of the REF/ALT allele issue. Is there a standalone version of this “add-rsids” tool somewhere?

  3. After I successfully run pheweb, and type the command “pheweb process” again, I will get “Output files are all newer than input files, so there's nothing to do”. This is great. So, for me to run the pheweb one more time on the same GWAS file (probably with some new parameters), I have to delete the files in the “generated-by-pheweb” folder, correct? I guess that I only need to delete the “generated-by-pheweb/pheno_gz” folder in order to re-run. I should keep the “generated-by-pheweb/resources” folder which has 3 files (gene_aliases-v37.sqlite3, genes-v37-hg19.bed, rsids-v154-hg19.tsv.gz), correct. These 3 files also exist in the “cache” directory that I specified. This is a waste of disk space, since the rsids-v154-hg19.tsv.gz file is >6GB. Should I remove the “cache=” option in the config.py file, or this cache option really helps with the running speed? If so, can I delete it after I finished running pheweb?

  4. For me to run pheweb on a remote server, where I could close the terminal and go home, should I put the “&” sign at the end of the command “pheweb serve --open &” and then it still runs?

  5. The pheweb tool is really cool, by generating so much information in a nice web interface. However, as we know, there are always more and more extra information for genetic variants. Take this page for example https://pheweb.org/UKB-Neale/phenotypes, is there a way for me to add new columns into this table. For example, I might want to add a link to the PUBMED. For the “top variants”, I also want to add a link to UCSC genome browser. Since I run pheweb on my local machine (with all the code and dataset), not on a server, I hope there is a way to customize a little bit by user himself/herself.

  6. Since different GWAS uses different column names. For example, there are “REF”, “A1”, “Allele1”, “EA” for “effect allele”. So, in the config.py file, can I include them all? If so, how to write it correctly?

Your help is greatly appreciated.

Best regards, Jie

pjvandehaar commented 2 years ago
  1. Yeah, I agree, ref/alt is a pain. I need a script that automatically flips ref/alt to align with hg38. I have most of the necessary code in pheweb detect-ref. Or at least pheweb needs to notice the reversals and warn the user. I'd rather make ref match hg19/38 instead of needing to check both ref/alt and alt/ref in lots of places. I believe that dbSNP correctly matches hg19/38 with its ref alleles, but let me know if I'm wrong about that.

  2. I don't know. It's probably easy to build using pheweb's code. You could copy the code out of pheweb/load/add_rsids.py, or submit a patch that breaks some parts out into a function you can import in your own code.

3a. I think if you replace the input file (with a new timestamp, because it's new), pheweb will notice and regenerate everything downstream. If not, that's a bug. You shouldn't need to delete anything in generated-by-pheweb/.

3b. Yeah, you could set cache_dir = False in config.py. I think the documentation mentions it? If not, it should.

  1. You should use tmux or byobu. If you already know how to use screen, that's not as good but also works.

  2. The README mentions custom_templates which can do that (https://github.com/statgen/pheweb/blob/master/etc/detailed-webserver-instructions.md#customizing-page-contents). I want to rewrite how these tables work to add a bunch of optional columns like that so that you won't have to use custom templates. Are you interested in helping with that? First I want to replace the old table javascript with Tabulator or maybe DataTables.

  3. field_aliases = {'A1': 'ref', 'NEA': 'ref'}.

I hope this is helpful. Sorry, I'm responding in a bit of a rush.

jielab commented 2 years ago

Dear Peter:

Thank you very much!

  1. I have a PYTHON savvy graduate student, who can help to update your add_rsids.py script so that it allows: (1) swapped ref/alt allele; (2) the output files to keep extra columns in the original file. We can make it work for both hg19/38 or any version since the merging function should not depend on the version of HG. The programmer recently helped me to build github.com/jielab/pageant, which is under revision and very likely to get published soon. As an alumnus of Michigan, I want to contribute a little bit to the amazing PheWEB tool. I think this add_rsids.py function is also used by locuszoom.

3b. Will setting "cache=None" be the same as not setting it at all?

  1. I don't know any of these: tmux, byobu, screen. I hope dreamhost would allow me to use those.

  2. It is great that I could use this approach to set "ref" multiple times.

=================

  1. I just run into a new error message: TypeError: unsupported operand type(s) for /: 'float' and 'str'. It seems pheweb does not like numeric values are written in the format of 2.11373e+00. If that is the reason, hope this can be addressed.

image

  1. Also, for error messages such as "But in your file, the chromosome '1' came after the chromosome '19'", it might be better to output that as a warning, but sort the data into the correct CHR and POS order and still process it. Otherwise, this issue is quote often in public GWAS.

Best regards, Jie

jielab commented 2 years ago

Dear Peter:

Another quick follow-up:

  1. Most of my GWAS files already have "rsid" field (usually "SNP"). But it seems that pheweb will ignore my SNP field and run add_rsids anyway, correct? I think it would be good to have an option to keep/use the original one.

  2. For the customization that i previously asked, I didn't mean to add some text into the "About" tab or something like that, but to add some URL to each of the loci displayed by pheweb, such as adding a link for each phenotype to its publication, adding a link for all genomic loci for their UCSC browser or dbSNP track.

  3. For "cache=" in the config.py file, I just found out that pheweb will automatically create a cashe in root/.XXX file if i did not set it. So, it seems that I had better "set cache=None". I just did not undertand why I need to set a cache for this, since all the files put in the cache are already put into the generated_by_pheweb/resources folder. Why do we need two copies?

BTW, I will ask my gradaute student programmer to take a look look at the add_rsids.py script, and hopefully can contribute a new version that you like.

Best regards, Jie

pjvandehaar commented 2 years ago

I have a PYTHON savvy graduate student, who can help to update your add_rsids.py script so that it allows: (1) swapped ref/alt allele; (2) the output files to keep extra columns in the original file

Thanks! If you make the change, I'd appreciate if you send your changes as a pull request. I'll make it optional for pheweb.

I just run into a new error message:

I can help you with that error if you post the full error text.

Also, for error messages such as "But in your file, the chromosome '1' came after the chromosome '19'", it might be better to output that as a warning, but sort the data into the correct CHR and POS order and still process it.

Yeah, I agree. I'd appreciate a pull request for this. If you work on this let me know and I’ll give some guidance.

Most of my GWAS files already have "rsid" field (usually "SNP"). But it seems that pheweb will ignore my SNP field and run add_rsids anyway, correct? I think it would be good to have an option to keep/use the original one.

I understand but I don't think I'd want this change in pheweb. If you want to make this change for yourself, add a new per_variant_field called "rsid" and then skip the step add-rsids.

For the customization that i previously asked, I didn't mean to add some text into the "About" tab or something like that, but to add some URL to each of the loci displayed by pheweb, such as adding a link for each phenotype to its publication, adding a link for all genomic loci for their UCSC browser or dbSNP track.

I believe this can be done with custom_templates/.

For "cache=" in the config.py file, I just found out that pheweb will automatically create a cashe in root/.XXX file if i did not set it. So, it seems that I had better "set cache=None".

Thanks, I just changed this in the latest version on github. PheWeb now doesn't cache by default. You'll have to use git clone and pip3 install -e . to get this version. For the stable version gotten from pip3 install pheweb you need cache_dir=False.

jielab commented 2 years ago

Dear Peter:

Thanks for providing the new version.

I previousloy did NOT use "git clone". Instead, I simply run "pip3 install pheweb" and then the software magically worked. I even could not find where the source .py scripts are located after running "pip3 install pheweb" . So, do I still need to run "git clone" after I run "pip3 install pheweb"? What will I get from "git clone" since i could already run pheweb?

I now run pip3 uninstall pheweb and then pip3 install pheweb. I hope that I got the latest version installed this way.

Best regards, Jie