Various feature requests

statgen / locuszoom-hosted

A web service to upload and share GWAS results with LocusZoom.js

https://my.locuszoom.org

MIT License

1 stars 0 forks source link

Various feature requests #19

Open jielab opened 3 years ago

jielab commented 3 years ago

Hi, there:

I am trying to run "tabix my.GWAS.gz" file. my.GWAS.gz file is tab delimited and it has columns such as CHR POS SNP REF ALT BETA SE P N. However, I got the error messages of "[E::get_intv] failed to parse TBX_GENERIC, was wrong -p [type] used?".

Also, I am requesting two features:

display more columns (or all columns) of the original file in the Top Loci table. Right now, it only lists rsID, CHR:POS, -logP. It would be good to display fields in my original GWAS file such as SNP, P, REF, ALT, BETA. Andy suggested to use the SHA256 hash to implement this. That sounds great.
display multiple GWAS Manhattan plots. For example, I run 3 BMI GWAS on the same data: 1. for males, 2. for females, 3, both sex. I would love to show these 3 Manhattan plots horizontally, and also the same top loci of the 3 GWAS on the same page (suc as for the FTO locus).

Thank you very much for your consideration.

Best regards, Jie

abought commented 3 years ago

A few notes from the emails- though it won't be worked on immediately, I'd like to jot down notes while context remains in my memory.

Requests broken down by type:

The first error/ thread title is an error with the tabix utility, which is not part of locuszoom-*. See tabix user manual for details. It sounds like you need to tell it where to find the chromosome (--sequence) and position (--begin and --end, usually the same thing in a gwas file). Possibly other options depending on your data, but we don't provide user support for third-party tools.
To verify the uploaded file is a specific one expected: the request is to display SHA256 of the uploaded file on the summary page
We'd prefer not to show raw pvalues (instead of -log10p) due to numerical underflow issues. However, showing more columns would be nice. We should start by enforcing display of ref and alt alleles, and eventually extend to allowing the parser to support other user-provided data.
I'd love to provide a way to compare studies, esp summary stats in the region plot view. This will require rethinking our search and metadata features, to help users discover relevant datasets that can be added.
- LocalZoom provides a comparison feature, because all the datasets are on the user's hard drive. In a website with a mix of personal and public data, finding really good comparisons takes a bit more work.

jielab commented 3 years ago

Dear Andy:

Thank you very much!

I now made tabix work by using tabix -f -s 1 -b 2 -e 2 MY.gwas.gz Previously, I used both -p bed and -b 2 -e 2, which are not compatible, because a bed format would mean the end position is the 3rd column and that could not be changed by -e 2.

One issue here is that I do need to a "#" in front of the first line in order for tabix to run. So, the first row of the first column is "#CHR" instead of "CHR". This creates a problem for other software. As you know, many software would not take a variable who value starting with a "#". Don't know if there is a good walkaround for this.

Thank you & best regards, jie

abought commented 3 years ago

No worries. GWAS file formats are a little weird- one reason that we don't officially offer email support to help people use tabix is that each research group, analysis program, or set of command line flags tends to introduce its own file format. There is no one single set of instructions we can write that makes tabix work on all these variations. I got up to about 30 plausible file formats and stopped counting.

I haven't tested this in tabix because it's 10pm and a cat is sitting on me, but I encourage you to read their documentation closely and particularly look at the skip lines option. By default tabix can automatically handle header rows that start with "#", but otherwise you need to tell it to skip a certain number of header rows. Also for gwas files, we specify the same position column to be both the begin and end flags in tabix. (Because it is a point, not an interval)

Since flexible tools require so much customization, we as a community tend to rely on the people around us to provide a first layer of support. If your research group has a way of sharing info (like a wiki), I encourage you to write down the final working tabix command in a way that makes knowledge available,. (Often groups have tools they prefer for their research area, or they use a shared analysis pipeline and a common file format when working together. If you can add the command to the automated pipeline, so much the better!)

On Jan 14, 2021, at 10:16 PM, Jie Huang notifications@github.com wrote:

Dear Andy:

Thank you very much! I understand that tabix is a 3rd-party tool, and I should not ask tabix question here.

But since LocalZoom only works with tabix indexed file, and I could not generate it even after i read all the tabix documentation.

I created a very simple file abc.txt, with just the following two lines (tab delimited): chrX 2700157 2700157 rs5939319 G chrX 2703633 2703633 rs1419931 A

After bgzip abc.txt, the command tabix -f -p bed abc.txt.gz works fine.

However, as you know, a regular GWAS file comes with a header line and the POS column is not duplicated. After I change the above 2 lines to the following 3 lines: CHR POS SNP REF chrX 2700157 rs5939319 G chrX 2703633 rs1419931 A

The tabix command does not work any more. Even if I used options such as "--sequence 1 --begin 2 --end 2 -S 1"

So, can you please kindly test my above abc.txt file with just 3 lines, and make tabix work for it?

I am sure that this piece of instruction would be very helpful for many users who need to create tabix indexed GWAS file in order to use LocalZoom.

Thank you & best regards, jie

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

jielab commented 3 years ago

Dear Andy:

Yes, I do know of the -S option to skip first N lines. Then I thought that if I skip the first header line, locuszoom would not be able to read the header line. It turns out that I am wrong. I could now use tabix -f -S 1 -s 1 -b 2 -e 2 MY.gwas.gz successfully without the need to add a # to the header row, and locuszoom has no problem to read in the header row. This is really GREAT!

Now I am able to use LocalZoom to view my local MY.gwas.gz files. Please see the screenshot below.

I do have a few minor suggestions/feedback:

I get an error "could not parse specified range", if i specify a range too big, such as 1:1000000-10000000. I thought that localzoom could present a Manhattan plot first, just like the uploading version of my.locuszoom.org. It would be nice to have Manhattan plot and a "Top Loci" table.
when I click the "LD Population EUR" button, the popup window will not display after I made a selection.
Please see the screenshot below. My input GWAS file actually has 4 columns regarding alleles: REF ALT A1 A2, where REF/ALT is based on reference human genome, while A1 is the effect allele. In most cases, A1 is the same as ALT, but not always. In Locuszoom data upload page, the first window "variant from columns" uses the term "Ref allele" and "Alt allele", but the second window (shown below) uses the term "effect allele". And I could only choose "Ref" or "Alt", but not A1 or A2. So, this is a bit confusion. Should I simply ignore the REF and ALT columns in my GWAS file, but only use A1 and A2 in this case?

Thank you & best regards, Jie

abought commented 3 years ago

As noted in the LocalZoom instructions,

"This service is designed to efficiently fetch only the data needed for the plot region of interest. Therefore, it cannot generate summary views that would require processing the entire file (eg Manhattan plots). "

Rather than maintain two different software codebases, advanced "summarize this file" features (like Manhattan plots and top loci) are explicitly provided only in my.locuszoom.org. LocalZoom is a viewer tool but is not meant a replacement for an analysis pipeline.

Likewise, the max region size of LocalZoom currently caps out at ~1MB. We may increase this to 2MB in the future, but no-upload client side localzoom is not intended to be a full multiscale genome viewer.

abought commented 3 years ago

Note to self: we do need to clarify the terminology on the "allele frequency" section; thanks for catching that!

Essentially, conventions for specifying allele frequency vary widely. Some files give the allele frequency for a specific allele of interest (eg effect allele, "major/minor", etc), which may not be the same as the variant specified in the "alt" column. (Another common convention is to specify counts instead of frequencies)

Instead of assuming that AF = "alt" frequency, we allow people to use any of ~3 different conventions, and tell the parser how to align their data with a consistent harmonized reference. Our hope is to provide advanced tools for comparing your results to other public studies in the future, but doing that requires some rather fiddly and sometimes confusing UI to ensure that all uploaded files end up harmonized so that a given column means the same thing across files.

jielab commented 3 years ago

Dear Andy:

Thank you very much for clarification!

Maybe my.locuszoom.org could be designed similarly as PheWeb, so that users could get it set up in their local server.

For example, a group put all thousands of UK Biobank GWAS results at fastgwa.info. As shown on this link http://fastgwa.info/ukbimp/pheno/20015, there is also a Manhattan Plot followed by a "Top Loci" table. Users can also click on each locus of the Top Loci table. But of course users will get a phewas plot instead of a Locuszoom plot, since "The online tool was developed based on the source code modified from PheWeb" (http://fastgwa.info/ukbimp/about).

Since PheWeb was also developed at UMICH, you guys might know each other very well. It would be very nice to see these two tools working together. For the above "Top Loci" table that i mentioned, it would be really nice to have a link for a phewas plot, and another link for a locuszoom plot.

What do you think?

Best regards, Jie

abought commented 3 years ago

We are indeed familiar with PheWeb- in fact, the code to prepare the manhattan and QQ plots is shared between the two projects.

However, when we built my.locuszoom.org, we consciously chose not to try to duplicate the core purpose or focus between the two services. PheWeb is aimed at presenting many different GWAS studies together in one place, whereas my.locuszoom.org is focused on letting users explore individual studies. By encouraging "bulk import" users to try PheWeb instead, we are able to provide a free and easy to use upload-your-own service to a large community: some PheWebs may involve terabytes of starting data and days of server-side processing, and I'm not sure that our research group could afford to host every pheweb for all genetics researchers in the world!

If we get enough high-quality public datasets with good metadata, I could see letting users request a phewas out of existing studies in the future. We aren't currently there yet, so we try to provide the same high quality annotations per study, but not generate a phewas from everything on the site.

If you really want to host your own my.locuszoom.org instance, code and notes are (mostly) in this repository where we are discussing and we always welcome contributions to help make deployment more streamlined. But I would absolutely start by defining the goals, as you might be able to get the customizations you want by creating a more focused tool with just the plotting code (LocusZoom.js) by itself.

jielab commented 3 years ago

Dear Andy:

Thank you very much again! I will not try to customize locusZoom, because you guys are the experts and I just want to be a good user :- ). I will try to come up with bug reporting and wise suggestions, while not wasting too much of your precious time to read my messages :- )

One minor feature if I could request: can the axis labels and the gene names on the Manhattan Plot in bold font and slightly larger size, just like that in the LocusZoom Plot? Also, it would be great to have a "save as PNG" option for the Manhattan Plot, just like the LocusZoom plot.

Your help is greatly appreciated.

Best regards, Jie

abought commented 3 years ago

This ticket has a lot of different things to unpack, and I'll try to distill into a more focused checklist in the near future.

Per initial discussions from the email list, several quality-of-life improvements have been shipped in the newest release:

The "top loci" table will automatically show ref and alt alleles when such information is available.
Sorting the top loci table by marker will now correctly sort by chromosome and position, instead of lexicographic
To help users verify that they uploaded the correct file, a new "checksum" button has been added to the manhattan plot page (visible only to the person who uploaded the study). You can use this to view the SHA256 for what was originally uploaded. This is a hash value that distills everything into the file into a single short string that can be calculated locally.

jielab commented 3 years ago

thank you very much, Andy!

Best regadrs, jie

abought commented 3 years ago

I've gone through this ticket and tried to triage various actionable suggestions (which may not all get fixed or all at once). Checklist of major remaining items:

[ ] Easier to read font size/face for manhattan plot axis labels
[ ] Optionally show more columns for top loci table on manhattan plot / summary page
[x] New "compare studies" view (requires many internal enhancements)
[ ] Clarify wording around the "effect allele" checkbox in the allele frequency "column picker" UI