nmdp-bioinformatics / ImmunogeneticDataTools

Immunogenetic Data Tools related to HLA, GLStrings, Linkage Disequilibrium
11 stars 7 forks source link

Summary #73

Closed kosoegawa closed 7 years ago

kosoegawa commented 7 years ago

It is great if we could have a summary that contains both haplotype paris and haplotype frequencies in a single file

First [HLA-A, HLA-C, HLA-B, HLA-DRB1, HLA-DQB1] Haplotype pair: HLA-A02:01g~HLA-C03:03g~HLA-B55:01~HLA-DRB113:01~HLA-DQB106:03g [Hap1] HLA-A33:03g~HLA-C03:02~HLA-B58:01g~HLA-DRB113:02~HLA-DQB106:09 [Hap2] Broad Race: AFA Freq: + 9.072092601647355E-9, Relative Freq (%): + 89.35 Freq: + 1.439175106E-5, Rank: + 8771.0 [Hap1] Freq: + 6.303675323333E-4, Rank: + 177.0 [Hap2]

Broad Race: API Freq: + 2.8777463660194647E-8, Relative Freq (%): + 100.00 Freq: + 3.0529489158E-6, Rank: + 17396.0 [Hap1] Freq: + 0.0094261202705561, Rank: + 7.0[Hap2]

Broad Race: CAU Freq: + 4.161517469724582E-8, Relative Freq (%): + 99.62 Freq: + 9.21712619603E-5, Rank: + 1439.0 {Hap1] 4.514983717503E-4, Rank: + 305.0 [Hap2]

Broad Race: HIS Freq: + 1.8302949101784554E-9, Relative Freq (%): + 72.11 Freq: + 7.8604863158E-6, Rank: + 10523.0 [Hap1] Freq: + 2.328475410611E-4, Rank: + 681.0 [Hap2]

Broad Race: NAM Freq: + 2.951370262344005E-8, Relative Freq (%): + 100.00 Freq: + 4.95517929117E-5, Rank: + 2618.0 [Hap1] Freq: + 5.956132137546E-4, Rank: + 261.0 [Hap2]

Detailed Race: AAFA Freq: + 5.083507167091338E-9, Relative Freq (%): + 100.00 Freq: + 7.7626351634E-6, Rank: + 12500.0 [Hap1] Freq: + 6.548687475433E-4, Rank: + 172.0 [Hap2]

Detailed Race: AINDI Freq: + 6.321684375847337E-8, Relative Freq (%): + 100.00 Freq: + 9.8541916977E-6, Rank: + 7402.0 [Hap1] Freq: + 0.0064152236629645, Rank: + 10.0 [Hap2]

Detailed Race: EURCAU Freq: + 3.2814279790121384E-8, Relative Freq (%): + 99.79 Freq: + 9.79772120929E-5, Rank: + 1348.0 [Hap1] Freq: + 3.349174679415E-4, Rank: + 412.0 [Hap2]

Detailed Race: MSWHIS Freq: + 3.2003580566482058E-9, Relative Freq (%): + 96.86 Freq: + 1.69265968926E-5, Rank: + 5602.0 [Hap1] Freq: + 1.890727401943E-4, Rank: + 865.0 [Hap2]

mpresteg commented 7 years ago

This essentially merges some of the information from the current linkages.log and the haplotypePairs.log output into a summary file. If this is done, would you continue to have a use for the individual / separate files still (linkages.log and haplotypePairs.log vs. the new summary.log)? It will impact how I implement.

kosoegawa commented 7 years ago

It is still good to keep linkages.log file as it is, because it contains all known haplotypes that may not showing up in haplotypePairs.log and summary. haplotypPairs.log could be as summary, because the summary would contain all the information in haplotypePairs.log.

My proposal is:

Keep all the files as they are.

Generate one summary report in XML format.

Kazutoyo Osoegawa, Ph.D. | Senior Research Scientist Stanford Blood Center Histocompatibility, Immunogenetics & Disease Profiling Laboratory 3155 Porter Drive Palo Alto, CA 94304 Tel: 650-724-0169 | Fax: 650-724-0294 Email: kazutoyo@stanford.edumailto:kazutoyo@stanford.edu

Shipping address: 3373 Hillview Avenue, Palo Alto, CA 94304 bloodcenter.stanford.edu | Give blood for life!

Confidential; Protected under State of California Evidence Code Section 1157 CONFIDENTIALITY NOTICE: E-mail may contain confidential information that is legally privileged. Do not read this e-mail if you are not the intended recipient. This e-mail transmission, and any documents, files or previous e-mail messages attached to it may contain confidential information that is legally privileged. If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of any of the information contained in or attached to this transmission is STRICTLY PROHIBITED. If you have received this transmission in error, please immediately notify us by reply e-mail or by telephone at (650) 723-5548, and destroy the original transmission and its attachments without reading or saving in any manner.


From: Matt Prestegaard notifications@github.com Sent: Friday, March 17, 2017 3:00 PM To: nmdp-bioinformatics/ImmunogeneticDataTools Cc: Kazutoyo Osoegawa; Author Subject: Re: [nmdp-bioinformatics/ImmunogeneticDataTools] Summary (#73)

This essentially merges some of the information from the current linkages.log and the haplotypePairs.log output into a summary file. If this is done, would you continue to have a use for the individual / separate files still (linkages.log and haplotypePairs.log vs. the new summary.log)? It will impact how I implement.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/nmdp-bioinformatics/ImmunogeneticDataTools/issues/73#issuecomment-287482044, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AIbRt2Xa4PRskajZ4fQ-EAplhvEQKeyXks5rmwJygaJpZM4MVt-9.

mpresteg commented 7 years ago

I've made some progress on creating a summary output in XML. A difficulty I'm encountering is the fundamental differences between how the Wikiversity frequencies are structured, and maintaining compatibility with them. Notably, those frequencies are not strictly organized by race as with the other sets we work with. Also, the frequency is sometimes blank. As a consequence of this, there is infrastructure within the software that allows for the Wikiversity stuff to be reported as strings / notes, instead of frequencies that may be used in a relative frequency computation (doubles and floats).

  1. I'd like to confirm we want to continue to maintain support for the Wikiversity frequencies
  2. If so, is it possible to organize them similarly to the other frequencies we use? To do so, we might need to: (a) Assigning a 'race' category of "Unspecified" or something? (b) Define a strategy for dealing with those rows where the frequency is 'blank' (ignore or other?) (c) Determine if reporting only to the second position is reasonable (as is the case with the other frequencies in use)

Depending upon your answers to these questions, the XML may take different (or multiple) forms to account for the different structure of the frequencies.

kosoegawa commented 7 years ago

It would be good to have summary XML from NMDP frequency. We do not need summary XML for Wikiversity for now. We should implement full four field frequencies when it becomes available.

kosoegawa commented 7 years ago

It is nice if the software could take XML/HML as an initial input file. Parse XML/HML sample glstring

mpresteg commented 7 years ago

XML output is now available. HML as an input will be created as a new issue.