rwdavies / STITCH

STITCH - Sequencing To Imputation Through Constructing Haplotypes
http://www.nature.com/ng/journal/v48/n8/abs/ng.3594.html
GNU General Public License v3.0
74 stars 19 forks source link

Can STITCH be a free alternative to Gencove for low-coverage WGS? #29

Closed cwarden45 closed 4 years ago

cwarden45 commented 4 years ago

Hi,

I am trying to test what is the minimum read coverage that can still be considered identifying information for a sample, starting by testing my own data.

I could successfully identify my own sample with the GENCOVE imputed variants for ~0.5x lcWGS from Nebula. You can see some more details in the "Kinship / Identity-By-Descent (Close Family Relationships)" section of this blog post.

I also had to sign a HIPAA release to get my raw genomics data (~15,000 reads) from Color. So, I am trying to see if I can actually identify my own sample with that amount of data, or if I should use something in-between as a minimum read coverage for the data to be considered identifying information. My goal is to run re-analysis locally. However, if it helps, all of that data is already publicly available on my PGP page.

I apologize, but I am currently trying to make sure that I understand the input files for STITCH.

For example, it looks like I need a .bam alignment rather than a .fastq file. This should be OK, and I plan on aligning the reads to hg19 with BWA-MEM. However, if you think there is anything that I need to know about the alignment step, please let me know.

Other than that, it looks like I run STITCH within R. As far as I can tell, the STITCH R package has been installed succesfully.

I am looking at the example code, and I have downloaded the human demo files.

However, I am still not completely sure what I need to do to get imputed genotypes for my own sample.

For example, it looks like I am supposed to provide a table of genotypes, but my goal is to be able to estimate the genotypes (at some level of accuracy that is probably noticably lower than regular variant calling).

To be fair, I don't want to include myself in the reference panel, but I am of mostly European ancestry.

I apologize for the long set of background information, but these are essentially my questions:

1) Can I use STITCH to impute genotypes without any initial genotyping (such as from the .bam file, without any starting variant information)?

2) Am I correctly understanding that gen_sequencing.txt file is a table of high-confidence genotypes (with "NA" for missing values that I want to impute)? If so, is is possible to impute any variants without having a starting set of higher confidence values?

3) Assuming everything else is OK, should I be trying to follow "Human example 3 - Run with reference panel with no updating" with an change in the reference panel to EUR populations? For example, could I just set reference_populations = c("CEU")?

Thank you very much for your assistance!

Sincerely, Charles

rwdavies commented 4 years ago

Hi Charles,

Responding to your initial message first

As for your points, it's probably easiest to answer them at once. First, the ideal use case for STITCH is for a large-ish population of samples (say >1000), where the most benefit comes from a population without a reference panel as STITCH builds its own, where STITCH iteratively i) builds an internal reference panel ii) imputes each sample from that internal reference panel. STITCH can work in your setting, where you have a single sample (you) you want to impute, following the guidance of Human Example 3, where you basically give STITCH a reference panel (for example 1000 Genomes), and tell it not to do the updating, just to impute you from the reference panel at exactly the reference panel sites. As for point 1 above, "can I use STITCH to impute variants without any initial genotyping", that would be yes, you would have your .bam, you would have a list of positions + haplotypes from say 1000 Genomes, and you would follow a setup like with Human Example 3, and in that sense, not do any genotyping. For point 2 above, gen_sequencing.txt is an optional file, which is useful for internal estimation of accuracy. It is particularly useful for example if you are imputing many thousands of samples and have truth data on some subset, then for each iteration of the normal STITCH procedure, you'll get a measure of accuracy at each iteration. This can be useful if you want to choose between, say, 20 EM iterations and 100, or different heuristics. In your case, it is not relevant, as if you had truth data, you wouldn't be doing your imputation! (or I guess you could have truth data and just want to assess this in general). In any case, you don't need to worry about gen_sequencing.txt, it is optional

Hope that helps, Robbie

cwarden45 commented 4 years ago

Hi Robbie - thank you very much for your prompt response!

Yes - that does seem like a very small number of reads. It is possible that Color was just being extra strict when making me sign a HIPAA release to get that data, even thought I think it is less than most other companies (at least among those that provided any raw data).

However, I am still interested in looking into this for a couple reasons:

1) Color has at least one papers about using lcWGS for Polygenic Risk Scores, and you can see one such pre-print with a comment from myself here. So, if this represents their lcWGS data, then having more difficulty even being able to identify myself (and comparing variant concordance) may help make my point more clearly.

2) In terms of the FASTQ files being identifying information (for privacy concerns), I don't actually need variants at any given position to be reliable. I think it would even be enough to be able to get a relatively high relatedness score for my own very low coverage WGS sample with a noticeable drop from higher coverage data (as long as the relatedness is still noticeably higher than a parent or sibling relationship). That said, even if this doesn't work, I think the room for future developments and precedence for other studies that needed to be either have filtered human reads or be controlled access is sufficient to need to be extra careful about making FASTQ data public (if you don't have "explicit consent" from the patient to do that).

In terms of generally using myself as a positive, independent control, I have done that with the much higher lcWGS data from Nebula. You can see the problems in this blog post, and the relatedness / ancestry estimates in this blog post.

There were some slightly drops in performance for relatedness, and the accuracy of any variant dropped down to ~90% for the lcWGS sites (compared to >95% concordance for Exome versus regular WGS SNPs). However, I think the accuracy can drop a decent bit and still be able to use the FASTQ files to identify myself versus others. For example, to throw out a number, maybe I can still find myself if the imputed SNPs have 70% accuracy?

I might need to take some time to understand things a little better, but it sounds like STITCH may be able to help me (even though it was designed for a different purpose). I have also asked for the ability to upload FASTQ files for Gencove imputation (from scratch), but I haven't heard back about that yet (even though I would guess that may be more sensitive than looking at the variable sites between populations with 1 read and ignoring all rare variant positions, assuming that site is heterozygous, and testing something based more upon a SNP chip imputation method). If I understand what you are saying, it sounds like I can just provide the .bam file (and I don't even need to do what I described as the modification, since the gen_sequencing.txt file is optional).

Thank you again!

cwarden45 commented 4 years ago

In other words, my goal is to compare lcWGS data (with read counts varying from my Color sample to my Nebula sample), with reference sample genotypes defined another way (either from whatever you you in the demo datasets, or I think I can create the necessary file from the 1000 Genomes Omni SNP chip data).

I also know that I can identify myself with ~10,000 SNP chip probes. So, if I continue to reduce a chip-to-chip comparison, then that can probably serve as a positive control.

The imputation may not be needed for comparing lcWGS-to-regularWGS (and/or SNP chip), but I think it may be needed for lcWGS-to-lcWGS. If that works worse, then that also helps justify my concern about using lcWGS for discovery purposes (although I don't think there are currently many companies that try to do that - for example, Nebula currently offers regular WGS instead of lcWGS sequencing).

I think you have explained what I need to know to get started, so thank you again!

rwdavies commented 4 years ago

Quick naive 10pm sleepy question on my end - for clarity, what is your ultimate interest? Going back to your first sentence, "I am trying to test what is the minimum read coverage that can still be considered identifying information for a sample, starting by testing my own data.", can you formalize this? Is this "What is the minimum read coverage using lc-WGS that would allow me to uniquely identify a sample (myself) from among a target population of 7 billion people sequenced to 30X using WGS?"

cwarden45 commented 4 years ago

My downstream goal is to calculate Kinship measures (for genetic relatedness between samples).

Right now, they are pretty similar to a self-match (for your own sample, or a monzygotic twin sample). If I start to see things about half-way between the maximum parent-to-child relationship in 1000 Genomes, then I might have to worry about an increased range of Kinship values with more samples. However, I currently don't know what that looks like (or whether I can just check for recovery with a Kinship value of >0.45, without another tier of noisy relatedness estimates).

I will also be comparing reduction of sites used for that Kinship estimate, so I think that should also help capture some of the variability in the parent-child relationships.

While the exact values are a little lower (and the accuracy for individual sites is sufficiently worse that I would consider it unacceptable in a clinical setting), I would say my ~0.5x lcWGS sample from Nebula (with provided Gencove imputations) as visually indistinguishable from other pairs of comparisons for my own samples (where are called "New" in these sense they are newer than the 1000 Genomes samples):

Genetic Distance

Those "Self/New" data points above are for 1) Veritas WGS, 2) 23andMe SNP chip, 3) genes for Good SNP chip, and 4) Nebula lcWGS (along with the matching positions from the 1000 Genomes Omni SNP chip).

I believe those Color lcWGS values were currently used my for ancestry estimate, which said that I was 100% European. This misses a 2-3% African ancestry that I can see from other results / methods. However, if dropping to ~15,000 reads only causes a drop in ~5% for the Kinship value, then I would say that is pretty clear self-identification (to justify calling the FASTQ files identifiable information, meaning Color was right to call that PHI under HIPAA).

That said, I would not consider a negative result to mean these things shouldn't be taken seriously. For example, the methods may not be ideal and/or there may be privacy concerns beyond self-identification. Plus, part of it is just me wanting to see what happens when I use less reads for my own lcWGS data.

Thank you for being interested enough in this to think about it more!

UPDATE (2/29/2020): If anyone is curious, I tested down-sampling the 1000 Genomes samples (and my own samples, with the higher ~0.5x Nebula lcWGS) and you can see that here. I think the question now is how much lower is the imputation accuracy with fewer read counts for the lcWGS (and can I still get ~1,000 variants or more).

rwdavies commented 4 years ago

One random thought - if you have lcWGS, and want to see if it can from a diploid sample, can't you calculate something like P(read | comes from person 1) = P(read | comes from person 1 hap 1) P(read comes from hap 1) + P(read | comes from person 1 hap 2) P(read comes from hap 2) given phasing, but the phasing need not be very accurate or just consider a single SNP for each read (at most)

then get something like (if there are few enough reads that they are in linkage equilibrium) P(data | comes from person i) = \prod_{i_read} P(read i_read | comes from person i) then get something like, where model i is P(data comes from person i) P(model i | data) = P(data | model i) P(model i) / sum over models of the same thing where I suppose you need a model 0 of not in the dataset and some prior on model 0 (but data should overwhelm this most of the time?) hard part is P(data | model 0) but I think, with reads intersecting SNPs in LE, this can just be done using allele frequency

Just a thought! Should work though. And would obviate need to use any software packages other than fastq -> bam -> GATK UG or samtools. Then just needs some reference dataset you want to compare against, plus an allele frequency for some population

cwarden45 commented 4 years ago

Thank you for the feedback!

I think plink is calculating the kinship/relatedness without me needing to use something like SHAPEIT upstream (and it is a lot faster than SHAPEIT). However, I am sure there are other situations where more precise phasing is needed.

I actually still haven't heard back from Gencove about simply uploading the FASTQ files and getting imputed variants. However, I will probably have to wait until this weekend to look into this more anyways. I think they are doing extra steps, but I am trying to see what I can do with open-source programs (even though I am not opposed to temporarily paying for testing, but the request for premium features like uploading FASTQ files is what I haven't heard back about).

I think I have seen something about only allowing one mismatch per read for some analysis. I tried to quickly look back at the STITCH paper (in case that is where I read that), but maybe I read that somewhere else (or I am over-looking something, in which case I apologize).

rwdavies commented 4 years ago

The one mismatch per read might have come from a number of sources, particularly mapping or assembly type approaches. Fine for Illumina (most of the time), but less so for more error prone methods, or if the assembly is a bit worse. In the blurb I wrote above, one can easily handle more mismatches, using the math laid out in STITCH, which is not especially novel in that context - it is the obvious way to make a read contribute to the likelihood

Good luck with the rest of your analyses, and let me know if you have other questions

cwarden45 commented 4 years ago

Hi Robert - I can start a new ticket/issue (if it helps), but I don't think I quite understand the alternative command that I need to use as a modified version of human example 3:

STITCH(
  bamlist = list(input_bam),
  sampleNames_file = paste(sampleID,"-names.txt",sep=""),
  outputdir = sampleID,
  method = "diploid",
  originalRegionName = "20.1000000.1100000",
  regenerateInput = FALSE,
  regionStart = 1000000,
  regionEnd = 1100000,
  buffer = 10000,
  niterations = 1,
  chr = "chr20",
  inputBundleBlockSize = 100,
  reference_populations = c("CEU", "GBR", "ACB"),
  reference_haplotype_file = human_matched_to_reference_reference_haplotype_file,
  reference_sample_file = human_reference_sample_file,
  reference_legend_file = human_matched_to_reference_reference_legend_file,
  posfile = human_posfile,
  shuffleHaplotypeIterations = NA,
  refillIterations = NA,
  K = human_K, tempdir = "temp", nCores = detectCores(), nGen = human_nGen)

I also have that code uploaded with these notes, which I plan to update when I have a working version.

I think I will also need to make changes and/or have more questions to cover all of my chromosomes as well as fix a difference in the chromosome formatting.

However, I don't think I currently have the right set of input parameters because I am getting this error:

Error in validate_bamlist_and_cramlist_for_input_generation(regenerateInput,  :
  If not regenerating input, please do not supply bamlist
Calls: STITCH -> validate_bamlist_and_cramlist_for_input_generation

Can you please help me troubleshoot?

cwarden45 commented 4 years ago

Sorry - I think I figured out part of the problem is that I needed to set regenerateInput = TRUE.

It also looks like I needed to just provide the input .bam file name (as a string, rather than a list):

Error: bamlist = list(input_bam) No Error: bamlist = input_bam

I can also successfully get STITCH to start running changing the following files (from "20" to "chr20"):

STITCH_human_example_2016_10_18/pos.txt --> STITCH_human_example_2016_10_18/pos-ALT.txt

STITCH_human_reference_example_2018_07_11/pos.intersect.txt --> STITCH_human_reference_example_2018_07_11/pos.intersect-ALT.txt

However, I then get the following error message:

Error in get_sample_names(bamlist = bamlist, cramlist = cramlist, nCores = nCores,  :
  There are 1 sample names according to the sampleNames_file but there are 2254268 according to the bam / cram list
Calls: STITCH -> get_sample_names
In addition: There were 50 or more warnings (use warnings() to see the first 50)

I think the issue may be that I am supposed to provide a file with the .bam file path in it (instead of an R object with the .bam file name). However, when I change that I also get an error message:

[2020-03-01 15:20:18] Program start
[2020-03-01 15:20:18] Get and validate pos and gen
[2020-03-01 15:20:18] Done get and validate pos and gen
[2020-03-01 15:20:18] There are 89 variants in the left buffer region 990000 <= position < 1000000
[2020-03-01 15:20:18] There are 858 variants in the central region 1000000 <= position <= 1100000
[2020-03-01 15:20:18] There are 67 variants in the right buffer region 1100000 < position <= 1110000
Error in x3[[i_core]] : subscript out of bounds
Calls: STITCH ... get_bundling_position_information -> lapply -> FUN -> getOutputBlockRange
In addition: Warning messages:
1: In readLines(bamlist) :
  incomplete final line found on 'Nebula_full_nodup-bams.txt'
2: In read.table(sampleNames_file, stringsAsFactors = FALSE) :
  incomplete final line found by readTableHeader on 'Nebula_full_nodup-names.txt'

This is the current command:

STITCH(
  bamlist = paste(sampleID,"-bams.txt",sep=""),#I think this is supposed to be file with a list of .bams, not 1 bam file (or a "list" of .bam files)
  sampleNames_file = paste(sampleID,"-names.txt",sep=""),
  outputdir = sampleID,
  method = "diploid",
  originalRegionName = "20.1000000.1100000",
  regenerateInput = TRUE,
  regionStart = 1000000,
  regionEnd = 1100000,
  buffer = 10000,
  niterations = 1,
  chr = "chr20",
  inputBundleBlockSize = 100,
  reference_populations = c("CEU", "GBR", "ACB"),
  reference_haplotype_file = human_matched_to_reference_reference_haplotype_file,
  reference_sample_file = human_reference_sample_file,
  reference_legend_file = human_matched_to_reference_reference_legend_file,
  posfile = human_posfile,
  shuffleHaplotypeIterations = NA,
  refillIterations = NA,
  K = human_K, tempdir = "temp", nCores = detectCores(), nGen = human_nGen)

I apologize for all of the updates (as I found the solution for some problems), but can you please still help me troubleshoot?

cwarden45 commented 4 years ago

To give an update, I have found a way to get STITCH running without any error (with a limited number of low coverage .bam files), with details that can be found here.

Perhaps unsurprisingly, the performance under that situation to troubleshoot the code is not very good:

estimated genotype recovery

So, I would like to continue trying to get this code to work.

If I revise the code to use something that was successful for STITCH under a different scenario:

sampleID = "Nebula_full_nodup"
output_prefix = "Nebula_full_nodup"
pos_folder = "human_g1k_v37-pos_files"

library("STITCH")

#human_genfile = "gen_sequencing.txt" #this is supposed to be optional
human_K = 10
human_nGen = 4 * 20000 / human_K
temp_folder = tempdir()

human_posfile = "STITCH_human_example_2016_10_18/pos.txt"
human_reference_sample_file = "1000GP_Phase3.sample"
human_reference_legend_file =  "1000GP_Phase3_chr20.legend.gz"
human_reference_haplotype_file = "1000GP_Phase3_chr20.hap.gz"

human_matched_to_reference_genfile = "STITCH_human_reference_example_2018_07_11/gen_sequencing.intersect.txt"#also optional?
human_matched_to_reference_posfile = "STITCH_human_reference_example_2018_07_11/pos.intersect.txt"
human_matched_to_reference_reference_legend_file = "STITCH_human_reference_example_2018_07_11/1000GP_Phase3_20.1000000.1100000.legend.gz"
human_matched_to_reference_reference_haplotype_file = "STITCH_human_reference_example_2018_07_11/1000GP_Phase3_20.1000000.1100000.hap.gz"

#try to follow human example 3: https://github.com/rwdavies/STITCH/blob/master/examples/examples.R

STITCH(
  bamlist = paste(sampleID,"-bams.txt",sep=""),
  sampleNames_file = paste(sampleID,"-names.txt",sep=""),
  outputdir = sampleID,
  method = "diploid",
  originalRegionName = "20.1000000.1100000",
  regenerateInput = TRUE,
  regionStart = 1000000,
  regionEnd = 1100000,
  buffer = 10000,
  niterations = 1,
  chr = "20",
  inputBundleBlockSize = 100,
  reference_populations = c("CEU", "GBR", "ACB"),
  reference_haplotype_file = human_matched_to_reference_reference_haplotype_file,
  reference_sample_file = human_reference_sample_file,
  reference_legend_file = human_matched_to_reference_reference_legend_file,
  posfile = human_posfile,
  shuffleHaplotypeIterations = NA,
  refillIterations = NA,
  K = human_K, tempdir = temp_folder, nCores = detectCores(), nGen = human_nGen)

then I get the following error message:

Loading required package: parallel
Loading required package: rrbgen
[2020-03-17 17:12:29] Running STITCH(chr = 20, nGen = 8000, posfile = STITCH_human_example_2016_10_18/pos.txt, K = 10, S = 1, outputdir = Nebula_full_nodup, nStarts = , tempdir = /tmp/RtmpKwkQMY, bamlist = Nebula_full_nodup-bams.txt, cramlist = , sampleNames_file = Nebula_full_nodup-names.txt, reference = , genfile = , method = diploid, output_format = bgvcf, B_bit_prob = 16, outputInputInVCFFormat = FALSE, downsampleToCov = 50, downsampleFraction = 1, readAware = TRUE, chrStart = NA, chrEnd = NA, regionStart = 1e+06, regionEnd = 1100000, buffer = 10000, maxDifferenceBetweenReads = 1000, maxEmissionMatrixDifference = 1e+10, alphaMatThreshold = 1e-04, emissionThreshold = 1e-04, iSizeUpperLimit = 600, bqFilter = 17, niterations = 1, shuffleHaplotypeIterations = NA, splitReadIterations = 25, nCores = 4, expRate = 0.5, maxRate = 100, minRate = 0.1, Jmax = 1000, regenerateInput = TRUE, originalRegionName = 20.1000000.1100000, keepInterimFiles = FALSE, keepTempDir = FALSE, outputHaplotypeProbabilities = FALSE, switchModelIteration = NA, generateInputOnly = FALSE, restartIterations = NA, refillIterations = NA, downsampleSamples = 1, downsampleSamplesKeepList = NA, subsetSNPsfile = NA, useSoftClippedBases = FALSE, outputBlockSize = 1000, outputSNPBlockSize = 10000, inputBundleBlockSize = 100, genetic_map_file = , reference_haplotype_file = STITCH_human_reference_example_2018_07_11/1000GP_Phase3_20.1000000.1100000.hap.gz, reference_legend_file = STITCH_human_reference_example_2018_07_11/1000GP_Phase3_20.1000000.1100000.legend.gz, reference_sample_file = 1000GP_Phase3.sample, reference_populations = c("CEU", "GBR", "ACB"), reference_phred = 20, reference_iterations = 40, reference_shuffleHaplotypeIterations = c(4, 8, 12, 16), output_filename = NULL, initial_min_hapProb = 0.2, initial_max_hapProb = 0.8, regenerateInputWithDefaultValues = FALSE, plotHapSumDuringIterations = FALSE, plot_shuffle_haplotype_attempts = FALSE, plotAfterImputation = TRUE, save_sampleReadsInfo = FALSE, gridWindowSize = NA, shuffle_bin_nSNPs = NULL, shuffle_bin_radius = 5000, keepSampleReadsInRAM = FALSE, useTempdirWhileWriting = FALSE, output_haplotype_dosages = FALSE)
[2020-03-17 17:12:29] Program start
[2020-03-17 17:12:29] Get and validate pos and gen
[2020-03-17 17:12:29] Done get and validate pos and gen
[2020-03-17 17:12:29] There are 89 variants in the left buffer region 990000 <= position < 1000000
[2020-03-17 17:12:29] There are 858 variants in the central region 1000000 <= position <= 1100000
[2020-03-17 17:12:29] There are 67 variants in the right buffer region 1100000 < position <= 1110000
Error in x3[[i_core]] : subscript out of bounds
Calls: STITCH ... get_bundling_position_information -> lapply -> FUN -> getOutputBlockRange
In addition: Warning messages:
1: In readLines(bamlist) :
  incomplete final line found on 'Nebula_full_nodup-bams.txt'
2: In read.table(sampleNames_file, stringsAsFactors = FALSE) :
  incomplete final line found by readTableHeader on 'Nebula_full_nodup-names.txt'
Execution halted

Can you please help me to continue to troubleshoot?

I think this is essentially the same error message as before, but I at least can say that I have gotten STITCH to run without errors in another scenario. Please let me know if you need any additional information.

rwdavies commented 4 years ago

Hey,

Thanks for the messages. So for the immediately above error, there is a warning that R is not happy reading in the bamlist text file and the sampleNames file. I'm wondering if they are both being interpreted to have length 0? What happens in R if you just do sampleNames <- read.table(paste(sampleID,"-names.txt",sep=""), stringsAsFactors = FALSE)[, 1] print(sampleNames) Can you check the end of line character in those files and that it makes sense?

Thanks, Robbie

cwarden45 commented 4 years ago

Hi Robbie - thank your feedback.

There is only actually 1 .bam file (for myself) in those files:

Nebula_full_nodup-bams.txt

../1000_Genomes_BAMs/CW_Nebula-provided.bam

Nebula_full_nodup-names.txt

Nebula_full_nodup

I cleaned up the code a bit (and added new lines to the above files) to make things easier to read:

bam_list = "Nebula_full_nodup-bams.txt"
name_list = "Nebula_full_nodup-names.txt"
output_prefix = "Nebula_full_nodup"
pos_folder = "human_g1k_v37-pos_files"

library("STITCH")

#human_genfile = "gen_sequencing.txt" #this is supposed to be optional
human_K = 10
human_nGen = 4 * 20000 / human_K
temp_folder = tempdir()

human_posfile = "STITCH_human_example_2016_10_18/pos.txt"
human_reference_sample_file = "1000GP_Phase3.sample"
human_reference_legend_file =  "1000GP_Phase3_chr20.legend.gz"
human_reference_haplotype_file = "1000GP_Phase3_chr20.hap.gz"

human_matched_to_reference_genfile = "STITCH_human_reference_example_2018_07_11/gen_sequencing.intersect.txt"#also optional?
human_matched_to_reference_posfile = "STITCH_human_reference_example_2018_07_11/pos.intersect.txt"
human_matched_to_reference_reference_legend_file = "STITCH_human_reference_example_2018_07_11/1000GP_Phase3_20.1000000.1100000.legend.gz"
human_matched_to_reference_reference_haplotype_file = "STITCH_human_reference_example_2018_07_11/1000GP_Phase3_20.1000000.1100000.hap.gz"

#try to follow human example 3: https://github.com/rwdavies/STITCH/blob/master/examples/examples.R

chr = 20
output_folder = paste(output_prefix,"_",chr,sep="")

STITCH(
  bamlist = bam_list,
  sampleNames_file = name_list,
  outputdir = output_folder,
  method = "diploid",
  originalRegionName = "20.1000000.1100000",
  regenerateInput = TRUE,
  regionStart = 1000000,
  regionEnd = 1100000,
  buffer = 10000,
  niterations = 1,
  chr = chr,
  inputBundleBlockSize = 100,
  reference_populations = c("CEU", "GBR", "ACB"),
  reference_haplotype_file = human_matched_to_reference_reference_haplotype_file,
  reference_sample_file = human_reference_sample_file,
  reference_legend_file = human_matched_to_reference_reference_legend_file,
  posfile = human_posfile,
  shuffleHaplotypeIterations = NA,
  refillIterations = NA,
  K = human_K, tempdir = temp_folder, nCores = detectCores(), nGen = human_nGen)

However, I get the same error message:

Loading required package: parallel
Loading required package: rrbgen
[2020-03-18 08:55:29] Running STITCH(chr = 20, nGen = 8000, posfile = STITCH_human_example_2016_10_18/pos.txt, K = 10, S = 1, outputdir = Nebula_full_nodup_20, nStarts = , tempdir = /tmp/RtmpxPWisF, bamlist = Nebula_full_nodup-bams.txt, cramlist = , sampleNames_file = Nebula_full_nodup-names.txt, reference = , genfile = , method = diploid, output_format = bgvcf, B_bit_prob = 16, outputInputInVCFFormat = FALSE, downsampleToCov = 50, downsampleFraction = 1, readAware = TRUE, chrStart = NA, chrEnd = NA, regionStart = 1e+06, regionEnd = 1100000, buffer = 10000, maxDifferenceBetweenReads = 1000, maxEmissionMatrixDifference = 1e+10, alphaMatThreshold = 1e-04, emissionThreshold = 1e-04, iSizeUpperLimit = 600, bqFilter = 17, niterations = 1, shuffleHaplotypeIterations = NA, splitReadIterations = 25, nCores = 4, expRate = 0.5, maxRate = 100, minRate = 0.1, Jmax = 1000, regenerateInput = TRUE, originalRegionName = 20.1000000.1100000, keepInterimFiles = FALSE, keepTempDir = FALSE, outputHaplotypeProbabilities = FALSE, switchModelIteration = NA, generateInputOnly = FALSE, restartIterations = NA, refillIterations = NA, downsampleSamples = 1, downsampleSamplesKeepList = NA, subsetSNPsfile = NA, useSoftClippedBases = FALSE, outputBlockSize = 1000, outputSNPBlockSize = 10000, inputBundleBlockSize = 100, genetic_map_file = , reference_haplotype_file = STITCH_human_reference_example_2018_07_11/1000GP_Phase3_20.1000000.1100000.hap.gz, reference_legend_file = STITCH_human_reference_example_2018_07_11/1000GP_Phase3_20.1000000.1100000.legend.gz, reference_sample_file = 1000GP_Phase3.sample, reference_populations = c("CEU", "GBR", "ACB"), reference_phred = 20, reference_iterations = 40, reference_shuffleHaplotypeIterations = c(4, 8, 12, 16), output_filename = NULL, initial_min_hapProb = 0.2, initial_max_hapProb = 0.8, regenerateInputWithDefaultValues = FALSE, plotHapSumDuringIterations = FALSE, plot_shuffle_haplotype_attempts = FALSE, plotAfterImputation = TRUE, save_sampleReadsInfo = FALSE, gridWindowSize = NA, shuffle_bin_nSNPs = NULL, shuffle_bin_radius = 5000, keepSampleReadsInRAM = FALSE, useTempdirWhileWriting = FALSE, output_haplotype_dosages = FALSE)
[2020-03-18 08:55:29] Program start
[2020-03-18 08:55:29] Get and validate pos and gen
[2020-03-18 08:55:29] Done get and validate pos and gen
[2020-03-18 08:55:29] There are 89 variants in the left buffer region 990000 <= position < 1000000
[2020-03-18 08:55:29] There are 858 variants in the central region 1000000 <= position <= 1100000
[2020-03-18 08:55:29] There are 67 variants in the right buffer region 1100000 < position <= 1110000
Error in x3[[i_core]] : subscript out of bounds
Calls: STITCH ... get_bundling_position_information -> lapply -> FUN -> getOutputBlockRange
Execution halted

Is the combination of .bam files (for my sample) and .txt files with genotypes (for the 1000 Genomes reference files) part of the problem?

In the examples, I thought I saw a 2-step way of running STITCH. Do you think that may help (if I can first work with my .bam file, and then compare the 1000 Genomes reference set)?

cwarden45 commented 4 years ago

I am got sure if it helps, but I get the same message if I only try to generate an input file from my .bam file:

bam_list = "Nebula_full_nodup-bams.txt"
name_list = "Nebula_full_nodup-names.txt"
output_prefix = "Nebula_full_nodup"

library("STITCH")

#human_genfile = "gen_sequencing.txt" #this is supposed to be optional
human_K = 10
human_nGen = 4 * 20000 / human_K
temp_folder = tempdir()

human_posfile = "STITCH_human_example_2016_10_18/pos.txt"

chr = 20
output_folder = paste(output_prefix,"_",chr,sep="")

STITCH(
  bamlist = bam_list,
  sampleNames_file = name_list,
  generateInputOnly = TRUE,
  outputdir = output_folder, method = "diploid",
  chr = chr, posfile = human_posfile,
  regionStart = 1000000,
  regionEnd = 1100000,
  buffer = 10000,
  inputBundleBlockSize = 100,
  K = human_K, tempdir = temp_folder, nCores = detectCores(), nGen = human_nGen)

There is the error message for that part of the code:

Loading required package: parallel
Loading required package: rrbgen
[2020-03-18 09:06:46] Running STITCH(chr = 20, nGen = 8000, posfile = STITCH_human_example_2016_10_18/pos.txt, K = 10, S = 1, outputdir = Nebula_full_nodup_20, nStarts = , tempdir = /tmp/RtmpCA70XP, bamlist = Nebula_full_nodup-bams.txt, cramlist = , sampleNames_file = Nebula_full_nodup-names.txt, reference = , genfile = , method = diploid, output_format = bgvcf, B_bit_prob = 16, outputInputInVCFFormat = FALSE, downsampleToCov = 50, downsampleFraction = 1, readAware = TRUE, chrStart = NA, chrEnd = NA, regionStart = 1e+06, regionEnd = 1100000, buffer = 10000, maxDifferenceBetweenReads = 1000, maxEmissionMatrixDifference = 1e+10, alphaMatThreshold = 1e-04, emissionThreshold = 1e-04, iSizeUpperLimit = 600, bqFilter = 17, niterations = 40, shuffleHaplotypeIterations = c(4, 8, 12, 16), splitReadIterations = 25, nCores = 4, expRate = 0.5, maxRate = 100, minRate = 0.1, Jmax = 1000, regenerateInput = TRUE, originalRegionName = NA, keepInterimFiles = FALSE, keepTempDir = FALSE, outputHaplotypeProbabilities = FALSE, switchModelIteration = NA, generateInputOnly = TRUE, restartIterations = NA, refillIterations = c(6, 10, 14, 18), downsampleSamples = 1, downsampleSamplesKeepList = NA, subsetSNPsfile = NA, useSoftClippedBases = FALSE, outputBlockSize = 1000, outputSNPBlockSize = 10000, inputBundleBlockSize = 100, genetic_map_file = , reference_haplotype_file = , reference_legend_file = , reference_sample_file = , reference_populations = NA, reference_phred = 20, reference_iterations = 40, reference_shuffleHaplotypeIterations = c(4, 8, 12, 16), output_filename = NULL, initial_min_hapProb = 0.2, initial_max_hapProb = 0.8, regenerateInputWithDefaultValues = FALSE, plotHapSumDuringIterations = FALSE, plot_shuffle_haplotype_attempts = FALSE, plotAfterImputation = TRUE, save_sampleReadsInfo = FALSE, gridWindowSize = NA, shuffle_bin_nSNPs = NULL, shuffle_bin_radius = 5000, keepSampleReadsInRAM = FALSE, useTempdirWhileWriting = FALSE, output_haplotype_dosages = FALSE)
[2020-03-18 09:06:46] Program start
[2020-03-18 09:06:46] Get and validate pos and gen
[2020-03-18 09:06:46] Done get and validate pos and gen
[2020-03-18 09:06:46] There are 89 variants in the left buffer region 990000 <= position < 1000000
[2020-03-18 09:06:46] There are 858 variants in the central region 1000000 <= position <= 1100000
[2020-03-18 09:06:46] There are 67 variants in the right buffer region 1100000 < position <= 1110000
Error in x3[[i_core]] : subscript out of bounds
Calls: STITCH ... get_bundling_position_information -> lapply -> FUN -> getOutputBlockRange
Execution halted
cwarden45 commented 4 years ago

Oh - I am sorry. In between changing code, I somehow overlooked changing the number of cores to 1

I will let you know if I can run the rest of the code (and then close the issue), but that error above could be solved by changing the following parameter:

nCores = 1

cwarden45 commented 4 years ago

I confirmed that I can generate a .vcf for the test region (which outputs imputed genotypes for just my sample). In addition to changing the number of cores, I also set niterations = 20 (since I needed to provide a value other than 1).

However, I now have a new problem: I want to run this for the full chromosome for all 22 autosomal chromosomes.

Is there a place where I can download a file that is ready to run in STITCH for the reference_haplotype_file and reference_legend_file parameters (matching the current reference_sample_file)?

I have the posfile for just comparing .bams, so I think that I only need larger/more reference files.

rwdavies commented 4 years ago

Huh about the nCores thing, I'll take a look. I can force STITCH to change that internally to have nCores <- min(nCores, N), where N is sample size

If you don't include reference information, it will try to impute just you, using only just you, which won't work (it may run, but produce non-sensible output)

On this question, "Is there a place where I can download a file that is ready to run in STITCH...", I wrote it in the largely the same format as IMPUTE2, which has files here https://mathgen.stats.ox.ac.uk/impute/impute_v2.html#reference they might need a bit of poking to be in exactly the same format (this is a little outside the normal use case of STITCH - this being "impute one sample using a reference panel" vs "impute many samples using no reference panel", which is why these files aren't readily curated by me)

cwarden45 commented 4 years ago

I see - thank you for your response.

I will see what I can do with those files. Otherwise, I will see what the results look like with that part of chr20.

If you have any ideas about how to improve performance (through parameter values) with the small set of low coverage .bam files (from this issue/question), please let me know. I also realize that may not be possible, but I want to try to represent the best STITCH results that I can produce on my local computer.

Since this seems like the limit of what I can do with the STITCH-provided files, I will close this issue.

Thank you again.

cwarden45 commented 4 years ago

Strictly speaking, STITCH runs without an error when I just provide the IMPUTE2 files. However, I think I see what you mean, in terms of providing the human_matched_to_reference_reference_haplotype_file instead of the human_reference_haplotype_file file (from your example code).

The number of positions estimated within the test region for chr20 is much smaller. However, my test positions for my sample are different anyways (and I think the total number of variants across all full chromosomes may be OK, if they are accurate). I also checked a couple of those genotypes - they were the same between the versions of STITCH, but they tended to be either 0/0 or 1/1.

I think part of the reason I am able to do this is that I set niterations = 20 instead of niterations = 1 (which gave me an error message for positions that don't strictly overlap).

I'll post another comment with the results, after modifying the code run for all full chromosomes. I think the run-time is quite reasonable, so maybe I can even have this posted today.

However, in the meantime, do you think STITCH is supposed to be giving an error message, even though it looks like it is running successfully?

cwarden45 commented 4 years ago

Even though STITCH ran for the demo range, I get an error if I try to use a similar strategy for the full chromosome 1:

Code:

bam_list = "Nebula_full_nodup-bams.txt"
name_list = "Nebula_full_nodup-names.txt"
output_prefix = "1KG_IMPUTE2-Nebula_full_nodup"

library("STITCH")
pos_folder = "human_g1k_v37-pos_files"

autosomal_chr = 1:22
fa_index =  "../1000_Genomes_BAMs/Ref/human_g1k_v37.fasta.fai"
fa_index.table = read.table(fa_index, head=F, sep="\t")

#human_genfile = "gen_sequencing.txt" #this is supposed to be optional
human_K = 10
human_nGen = 4 * 20000 / human_K
temp_folder = tempdir()

#try to follow human example 3: https://github.com/rwdavies/STITCH/blob/master/examples/examples.R

for (chr in autosomal_chr){
    print(paste("Running STITCH for ",chr,"...",sep=""))
    human_posfile = paste(pos_folder,"/",chr,"_pos.txt",sep="")
    output_folder = paste(output_prefix,"_",chr,sep="")

    human_reference_sample_file = "../1000_Genomes_BAMs/IMPUTE2_Files/1000GP_Phase3/1000GP_Phase3.sample"
    human_reference_legend_file = paste("../1000_Genomes_BAMs/IMPUTE2_Files/1000GP_Phase3/1000GP_Phase3_chr",chr,".legend.gz",sep="")
    human_reference_haplotype_file = paste("../1000_Genomes_BAMs/IMPUTE2_Files/1000GP_Phase3/1000GP_Phase3_chr",chr,".hap.gz",sep="")

    STITCH(
        bamlist = bam_list,
        sampleNames_file = name_list,
        outputdir = output_folder,
        method = "diploid",
        regenerateInput = TRUE,
        regionStart = 1,
        regionEnd = fa_index.table$V2[fa_index.table$V1 == chr],
        buffer = 10000,
        niterations = 20,
        chr = chr,
        inputBundleBlockSize = 100,
        reference_populations = c("CEU", "GBR", "ACB"),
        reference_haplotype_file = human_reference_haplotype_file,
        reference_sample_file = human_reference_sample_file,
        reference_legend_file = human_reference_legend_file,
        posfile = human_posfile,
        shuffleHaplotypeIterations = NA,
        refillIterations = NA,
        K = human_K, tempdir = temp_folder, nCores = 1, nGen = human_nGen)

      stop()
}#end for (chr in autosomal_chr)

Error:

Loading required package: parallel
Loading required package: rrbgen
[1] "Running STITCH for 1..."
[2020-03-18 14:31:41] Running STITCH(chr = 1, nGen = 8000, posfile = human_g1k_v37-pos_files/1_pos.txt, K = 10, S = 1, outputdir = 1KG_IMPUTE2-Nebula_full_nodup_1, nStarts = , tempdir = /tmp/RtmpcIcJVm, bamlist = Nebula_full_nodup-bams.txt, cramlist = , sampleNames_file = Nebula_full_nodup-names.txt, reference = , genfile = , method = diploid, output_format = bgvcf, B_bit_prob = 16, outputInputInVCFFormat = FALSE, downsampleToCov = 50, downsampleFraction = 1, readAware = TRUE, chrStart = NA, chrEnd = NA, regionStart = 1, regionEnd = 249250621, buffer = 10000, maxDifferenceBetweenReads = 1000, maxEmissionMatrixDifference = 1e+10, alphaMatThreshold = 1e-04, emissionThreshold = 1e-04, iSizeUpperLimit = 600, bqFilter = 17, niterations = 20, shuffleHaplotypeIterations = NA, splitReadIterations = 25, nCores = 1, expRate = 0.5, maxRate = 100, minRate = 0.1, Jmax = 1000, regenerateInput = TRUE, originalRegionName = NA, keepInterimFiles = FALSE, keepTempDir = FALSE, outputHaplotypeProbabilities = FALSE, switchModelIteration = NA, generateInputOnly = FALSE, restartIterations = NA, refillIterations = NA, downsampleSamples = 1, downsampleSamplesKeepList = NA, subsetSNPsfile = NA, useSoftClippedBases = FALSE, outputBlockSize = 1000, outputSNPBlockSize = 10000, inputBundleBlockSize = 100, genetic_map_file = , reference_haplotype_file = ../1000_Genomes_BAMs/IMPUTE2_Files/1000GP_Phase3/1000GP_Phase3_chr1.hap.gz, reference_legend_file = ../1000_Genomes_BAMs/IMPUTE2_Files/1000GP_Phase3/1000GP_Phase3_chr1.legend.gz, reference_sample_file = ../1000_Genomes_BAMs/IMPUTE2_Files/1000GP_Phase3/1000GP_Phase3.sample, reference_populations = c("CEU", "GBR", "ACB"), reference_phred = 20, reference_iterations = 40, reference_shuffleHaplotypeIterations = c(4, 8, 12, 16), output_filename = NULL, initial_min_hapProb = 0.2, initial_max_hapProb = 0.8, regenerateInputWithDefaultValues = FALSE, plotHapSumDuringIterations = FALSE, plot_shuffle_haplotype_attempts = FALSE, plotAfterImputation = TRUE, save_sampleReadsInfo = FALSE, gridWindowSize = NA, shuffle_bin_nSNPs = NULL, shuffle_bin_radius = 5000, keepSampleReadsInRAM = FALSE, useTempdirWhileWriting = FALSE, output_haplotype_dosages = FALSE)
[2020-03-18 14:31:41] Program start
[2020-03-18 14:31:41] Get and validate pos and gen
[2020-03-18 14:31:41] Done get and validate pos and gen
[2020-03-18 14:31:41] There are 0 variants in the left buffer region -9999 <= position < 1
[2020-03-18 14:31:41] There are 20133 variants in the central region 1 <= position <= 249250621
[2020-03-18 14:31:41] There are 0 variants in the right buffer region 249250621 < position <= 249260621
[2020-03-18 14:31:41] Generate inputs
[2020-03-18 14:31:42] Done generating inputs
[2020-03-18 14:31:42] Copying files onto tempdir
[2020-03-18 14:31:42] Done copying files onto tempdir
[2020-03-18 14:31:42] Generate allele count
[2020-03-18 14:31:42] Quantiles across SNPs of per-sample depth of coverage
[2020-03-18 14:31:42]     5%   25%   50%   75%   95%
[2020-03-18 14:31:42]  0.000 0.000 0.000 1.000 2.000
[2020-03-18 14:31:42] Done generating allele count
[2020-03-18 14:31:42] Outputting will be done in 3 blocks with on average 6711 SNPs in them
[2020-03-18 14:31:42] Begin parameter initialization
[2020-03-18 14:31:42] Begin initializing paramters using reference haplotypes
[2020-03-18 14:31:42] Begin get haplotypes from reference
[2020-03-18 14:31:42] Load and validate reference legend header
[2020-03-18 14:31:42] Load and validate reference legend
[2020-03-18 14:31:42] Keeping 286 out of 2504 reference samples from populations:CEU,GBR,ACB
[2020-03-18 14:34:56] Extract reference haplotypes
[2020-03-18 14:35:00] In the region to be imputed plus buffer there are the following number of SNPs:
[2020-03-18 14:35:00] 20133 from the posfile
[2020-03-18 14:35:00] 6500358 from the reference haplotypes
[2020-03-18 14:35:00] 20088 in the intersection of the posfile and the reference haplotypes
[2020-03-18 14:46:29] Succesfully extracted 572 haplotypes from reference data
[2020-03-18 14:46:29] End get haplotypes from reference
[2020-03-18 14:46:29] The following correlations are observed between allele frequencies estimated from the sequencing pileup and the reference haplotype counts:
[2020-03-18 14:46:29] 0.075 for all SNPs
[2020-03-18 14:46:29] -0.003 for > 5% MAF SNPs
[2020-03-18 14:46:29] 0.08 for < 5% MAF SNPs
[2020-03-18 14:46:29] A plot of allele frequencies from sequencing pileup vs reference haplotype counts is at:1KG_IMPUTE2-Nebula_full_nodup_1/plots/alleleFrequency_pileup_vs_reference_haplotypes.1.1.249250621.png
[2020-03-18 14:46:30] Begin determining haplotypes to use for initialization
Error in a[a > 0.5] <- 1 - a[a > 0.5] :
  NAs are not allowed in subscripted assignments
Calls: STITCH ... get_and_initialize_from_reference -> sample_haps_to_use
Execution halted

Is this something where you can help me troubleshoot?

If I look at the demo range that finished running, I guess this is possibly less than half-way done. However, if I can change the parameters (such as the regions to test) and get a result, maybe I am somewhat close to a successful run?

cwarden45 commented 4 years ago

I tested the following changes, but those unfortunately don't solve the problem:

1) Set buffer = 1000000, instead of buffer = 10000 2) Don't set inputBundleBlockSize = 100, which I believe sets that value to NA

I will see what else I can think of, but I would still appreciate any help (if possible).

cwarden45 commented 4 years ago

I have modified the code to try and avoid the telomeres and centromeres, but that doesn't appear to solve the problem:

bam_list = "Nebula_full_nodup-bams.txt"
name_list = "Nebula_full_nodup-names.txt"
output_prefix = "1KG_IMPUTE2-Nebula_full_nodup"
shift  = 40000000#I think this can be smaller, but I am trying to get something other than the demo window to work
buffer =  1000000#looking at the output, I think the buffer is more like the shift rather than the window size

library("STITCH")
pos_folder = "human_g1k_v37-pos_files"

autosomal_chr = 1:22
fa_index =  "../1000_Genomes_BAMs/Ref/human_g1k_v37.fasta.fai"
fa_index.table = read.table(fa_index, head=F, sep="\t")

#created following instructions from https://www.biostars.org/p/2349/
centromere_table = read.table("UCSC_Download_Centromere.txt", head=F, sep="\t")
centromere_table$V1 = as.character(centromere_table$V1)
centromere_table$V1 = gsub("chr","",centromere_table$V1)

#human_genfile = "gen_sequencing.txt" #this is supposed to be optional
human_K = 10
human_nGen = 4 * 20000 / human_K
temp_folder = tempdir()

#try to follow human example 3: https://github.com/rwdavies/STITCH/blob/master/examples/examples.R

for (chr in autosomal_chr){
    print(paste("Finding centromere for chromosome ",chr,"...",sep=""))
    temp.centromere.table = centromere_table[centromere_table$V1 == chr,]
    pos.values = c(temp.centromere.table$V2,temp.centromere.table$V3)
    centromere.start = min(pos.values)
    centromere.stop = max(pos.values)
    print(paste("Centromere length: ",centromere.stop-centromere.start," bp",sep=""))

    chr_length = fa_index.table$V2[fa_index.table$V1 == chr]

    ######################################
    ### sequence "left" of centromere ###
    ######################################
    STITCH_start = 1+shift
    STITCH_end = centromere.start-shift
    if((STITCH_end - STITCH_start) > 20 * buffer){
        print(paste("Running STITCH for left-side of ",chr,"...",sep=""))
        human_posfile = paste(pos_folder,"/",chr,"_pos.txt",sep="")
        output_folder = paste(output_prefix,"_",chr,"_left",sep="")

        human_reference_sample_file = "../1000_Genomes_BAMs/IMPUTE2_Files/1000GP_Phase3/1000GP_Phase3.sample"
        human_reference_legend_file = paste("../1000_Genomes_BAMs/IMPUTE2_Files/1000GP_Phase3/1000GP_Phase3_chr",chr,".legend.gz",sep="")
        human_reference_haplotype_file = paste("../1000_Genomes_BAMs/IMPUTE2_Files/1000GP_Phase3/1000GP_Phase3_chr",chr,".hap.gz",sep="")

        STITCH(
            bamlist = bam_list,
            sampleNames_file = name_list,
            outputdir = output_folder,
            method = "diploid",
            regenerateInput = TRUE,
            regionStart = STITCH_start,
            regionEnd = STITCH_end,
            buffer = buffer,
            niterations = 20,
            chr = chr,
    #       reference_populations = c("CEU", "GBR", "ACB"),#this is a reference set of 286 samples
            reference_populations = c("CEU"),#this is a reference set of 99 samples     
            reference_haplotype_file = human_reference_haplotype_file,
            reference_sample_file = human_reference_sample_file,
            reference_legend_file = human_reference_legend_file,
            posfile = human_posfile,
            shuffleHaplotypeIterations = NA,
            refillIterations = NA,
            K = human_K, tempdir = temp_folder, nCores = 1, nGen = human_nGen)
        }#end if((STITCH_end - STITCH_start) > 20 * buffer)

    ######################################
    ### sequence "right" of centromere ###
    ######################################
    STITCH_start = centromere.stop+shift
    STITCH_end = chr_length-shift

    if((STITCH_end - STITCH_start) > 20 * buffer){
        print("Add Code")
        stop()
    }#end if((STITCH_end - STITCH_start) > 20 * buffer) 
}#end for (chr in autosomal_chr)

I am still getting an error from get_and_initialize_from_reference -> sample_haps_to_use. Hopefully, I can think of something different tomorrow (or at least better understand what is causing the error).

rwdavies commented 4 years ago

6500358

This definitely seems like a bug. I'm looking at the code and test files and trying to understand what went wrong

rwdavies commented 4 years ago

For the chromosome 1, why are you only imputing 20133 variants instead of the entire reference set of 6500358? (unrelated to bug)

rwdavies commented 4 years ago

For the niterations=1 vs niterations=20 and issue of overlap, you can build an input posfile from the provided legend using something like the code below (STITCH only uses the reference panel on the first iteration, as a form of initialization, so if you are only imputing yourself, using >1 iteration, the imputation will not really work, as updating only relies on your data, not on external data anymore)

pos2 <- fread(cmd = paste0("gunzip -c ", reference_legend_file), data.table = FALSE)
pos <- cbind(1, pos2[, 2], pos2[, 3], pos2[, 4])
write.table(
    pos,
    file = new_posfile, ## some new file you specify
    sep = "\t",
    quote = FALSE,
    row.names = FALSE,
    col.names = FALSE
)
cwarden45 commented 4 years ago

Thank you very much for the feedback.

The smaller number relates to the sites that I have for my 23andMe, Genes for Good, and 1000 Genomes Omni SNP chip data. However, I can use more sites for the imputation, and then use a subset of them to compare to my other genomic data.

I will work on creating this alternative position file (and changing the number of iterations back to 1), and I will let you know how that goes.

I uploaded the smaller position file to GitHub. I may not be able to share the larger one this way. However, I will have the code (based upon your suggestion) posted.

Thank you again.

rwdavies commented 4 years ago

Can you share with me the posfile used that gave you this for chromosome 1 (is this already on your github repo?)

[2020-03-18 14:31:41] There are 0 variants in the left buffer region -9999 <= position < 1
[2020-03-18 14:31:41] There are 20133 variants in the central region 1 <= position <= 249250621
[2020-03-18 14:31:41] There are 0 variants in the right buffer region 249250621 < position <= 249260621

I'm having trouble reproducing the bug. I loaded the reference data from the IMPUTE2 link, and it, on it's own, doesn't allow me to reproduce the bug, and the code is pretty well tested and relatively simple (where the bug is occuring) so I think something unusually weird is going on in STITCH

cwarden45 commented 4 years ago

That portion of the error message is different when I exclude more of the telomeres or centromeres.

However, if it helps, that smaller position files is in human_g1k_v37-pos_files.zip, which can be downloaded here:

https://github.com/cwarden45/DTC_Scripts/tree/master/Nebula/Gencove/STITCH

I'll also let you know my progress with the larger (new) position files that I create, and without running more than 1 iteration (as soon as possible)

cwarden45 commented 4 years ago

With a modified version of the code that you provided (to remove sites with more than 1 variant, as well as remove sites with indels), then I think I can create valid position files from the IMPUTE2 files. However, if I set niterations=1, then I think I also have to create filtered versions of the IMPUTE2 files.

I will work on this. However, if others have a similar problem and want to see the (future) solution, there is what I am currently working on:

Code:

bam_list = "Nebula_full_nodup-bams.txt"
name_list = "Nebula_full_nodup-names.txt"
output_prefix = "1KG_IMPUTE2-Nebula_full_nodup"
shift  = 10000000#I think this can be smaller, but I am trying to get something other than the demo window to work
buffer =  1000000#looking at the output, I think the buffer is more like the shift rather than the window size

library("STITCH")
#pos_folder = "human_g1k_v37-pos_files"
pos_folder = "../1000_Genomes_BAMs/IMPUTE2_Files/1000GP_Phase3"#change to this to match reference set (with more SNPs)

autosomal_chr = 1:22
fa_index =  "../1000_Genomes_BAMs/Ref/human_g1k_v37.fasta.fai"
fa_index.table = read.table(fa_index, head=F, sep="\t")

#created following instructions from https://www.biostars.org/p/2349/
centromere_table = read.table("UCSC_Download_Centromere.txt", head=F, sep="\t")
centromere_table$V1 = as.character(centromere_table$V1)
centromere_table$V1 = gsub("chr","",centromere_table$V1)

#human_genfile = "gen_sequencing.txt" #this is supposed to be optional
human_K = 10
human_nGen = 4 * 20000 / human_K
temp_folder = tempdir()

#try to follow human example 3: https://github.com/rwdavies/STITCH/blob/master/examples/examples.R

for (chr in autosomal_chr){
    print(paste("Finding centromere for chromosome ",chr,"...",sep=""))
    temp.centromere.table = centromere_table[centromere_table$V1 == chr,]
    pos.values = c(temp.centromere.table$V2,temp.centromere.table$V3)
    centromere.start = min(pos.values)
    centromere.stop = max(pos.values)
    print(paste("Centromere length: ",centromere.stop-centromere.start," bp",sep=""))

    chr_length = fa_index.table$V2[fa_index.table$V1 == chr]

    human_reference_sample_file = "../1000_Genomes_BAMs/IMPUTE2_Files/1000GP_Phase3/1000GP_Phase3.sample"
    human_reference_legend_file = paste("../1000_Genomes_BAMs/IMPUTE2_Files/1000GP_Phase3/1000GP_Phase3_chr",chr,".legend.gz",sep="")
    human_reference_haplotype_file = paste("../1000_Genomes_BAMs/IMPUTE2_Files/1000GP_Phase3/1000GP_Phase3_chr",chr,".hap.gz",sep="")

    human_posfile = paste(pos_folder,"/",chr,"_pos.txt",sep="")

    if(!(file.exists(human_posfile))){
        print(paste("Generating full position file for chromosome ",chr,"...",sep=""))
        #use code modified from STITCH troubleshooting suggestion: https://github.com/rwdavies/STITCH/issues/29#issuecomment-601192607
        library("data.table")
        pos2 = fread(cmd = paste0("gunzip -c ", human_reference_legend_file), data.table = FALSE)
        pos = cbind(chr, pos2[, 2], pos2[, 3], pos2[, 4])
        pos.counts =  table(pos2[, 2])
        print(table(pos.counts))
        pos.counts = pos.counts[pos.counts == 1]
        #file needs to be sorted for STITCH (with unique positions)
        print(dim(pos))
        #remove sites with multiple variants at the same position
        pos = pos[match(names(pos.counts), pos2[, 2]),]
        print(dim(pos))
        #only include SNPs
        refNuc = pos[, 3]
        names(refNuc) = pos[, 2]
        refNucLen = sapply(refNuc, nchar)
        refNucLen = refNucLen[match(as.character(pos[, 2]), names(refNucLen))]
        refNucLen[is.na(refNucLen)]=0
        refNucLen =  as.numeric(refNucLen)

        varNuc = pos[, 4]
        names(varNuc) = pos[, 2]
        varNucLen = sapply(varNuc, nchar)
        varNucLen = varNucLen[match(as.character(pos[, 2]), names(varNucLen))]
        varNucLen[is.na(varNucLen)]=0
        varNucLen =  as.numeric(varNucLen)      

        pos = pos[(refNucLen == 1)&(varNucLen == 1),]
        print(dim(pos))
        write.table(
            pos,
            file = human_posfile, ## some new file you specify
            sep = "\t",
            quote = FALSE,
            row.names = FALSE,
            col.names = FALSE)
    }#end def if(!(file.exists(human_posfile)))

    ######################################
    ### sequence "left" of centromere ###
    ######################################
    STITCH_start = 1+shift
    STITCH_end = centromere.start-shift
    if((STITCH_end - STITCH_start) > 20 * buffer){
        print(paste("Running STITCH for left-side of ",chr,"...",sep=""))
        output_folder = paste(output_prefix,"_",chr,"_left",sep="")

        STITCH(
            bamlist = bam_list,
            sampleNames_file = name_list,
            outputdir = output_folder,
            method = "diploid",
            regenerateInput = TRUE,
            regionStart = STITCH_start,
            regionEnd = STITCH_end,
            buffer = buffer,
            niterations = 1,
            chr = chr,
    #       reference_populations = c("CEU", "GBR", "ACB"),#this is a reference set of 286 samples
            reference_populations = c("CEU"),#this is a reference set of 99 samples     
            reference_haplotype_file = human_reference_haplotype_file,
            reference_sample_file = human_reference_sample_file,
            reference_legend_file = human_reference_legend_file,
            posfile = human_posfile,
            shuffleHaplotypeIterations = NA,
            refillIterations = NA,
            K = human_K, tempdir = temp_folder, nCores = 1, nGen = human_nGen)
        }#end if((STITCH_end - STITCH_start) > 20 * buffer)

    ######################################
    ### sequence "right" of centromere ###
    ######################################
    STITCH_start = centromere.stop+shift
    STITCH_end = chr_length-shift

    if((STITCH_end - STITCH_start) > 20 * buffer){
        print("Add Code")
        stop()
    }#end if((STITCH_end - STITCH_start) > 20 * buffer) 
}#end for (chr in autosomal_chr)

Error:

Loading required package: parallel
Loading required package: rrbgen
[1] "Finding centromere for chromosome 1..."
[1] "Centromere length: 7400000 bp"
[1] "Generating full position file for chromosome 1..."
pos.counts
      1       2       3       4       5       6       7       8      12
6425256   34724    1332     277      81      16       3       2       1
[1] 6500358       4
[1] 6425256       4
[1] 6190898       4
[1] "Running STITCH for left-side of 1..."
[2020-03-19 11:50:24] Running STITCH(chr = 1, nGen = 8000, posfile = ../1000_Genomes_BAMs/IMPUTE2_Files/1000GP_Phase3/1_pos.txt, K = 10, S = 1, outputdir = 1KG_IMPUTE2-Nebula_full_nodup_1_left, nStarts = , tempdir = /tmp/Rtmpr0cLic, bamlist = Nebula_full_nodup-bams.txt, cramlist = , sampleNames_file = Nebula_full_nodup-names.txt, reference = , genfile = , method = diploid, output_format = bgvcf, B_bit_prob = 16, outputInputInVCFFormat = FALSE, downsampleToCov = 50, downsampleFraction = 1, readAware = TRUE, chrStart = NA, chrEnd = NA, regionStart = 10000001, regionEnd = 111500000, buffer = 1e+06, maxDifferenceBetweenReads = 1000, maxEmissionMatrixDifference = 1e+10, alphaMatThreshold = 1e-04, emissionThreshold = 1e-04, iSizeUpperLimit = 600, bqFilter = 17, niterations = 1, shuffleHaplotypeIterations = NA, splitReadIterations = 25, nCores = 1, expRate = 0.5, maxRate = 100, minRate = 0.1, Jmax = 1000, regenerateInput = TRUE, originalRegionName = NA, keepInterimFiles = FALSE, keepTempDir = FALSE, outputHaplotypeProbabilities = FALSE, switchModelIteration = NA, generateInputOnly = FALSE, restartIterations = NA, refillIterations = NA, downsampleSamples = 1, downsampleSamplesKeepList = NA, subsetSNPsfile = NA, useSoftClippedBases = FALSE, outputBlockSize = 1000, outputSNPBlockSize = 10000, inputBundleBlockSize = NA, genetic_map_file = , reference_haplotype_file = ../1000_Genomes_BAMs/IMPUTE2_Files/1000GP_Phase3/1000GP_Phase3_chr1.hap.gz, reference_legend_file = ../1000_Genomes_BAMs/IMPUTE2_Files/1000GP_Phase3/1000GP_Phase3_chr1.legend.gz, reference_sample_file = ../1000_Genomes_BAMs/IMPUTE2_Files/1000GP_Phase3/1000GP_Phase3.sample, reference_populations = CEU, reference_phred = 20, reference_iterations = 40, reference_shuffleHaplotypeIterations = c(4, 8, 12, 16), output_filename = NULL, initial_min_hapProb = 0.2, initial_max_hapProb = 0.8, regenerateInputWithDefaultValues = FALSE, plotHapSumDuringIterations = FALSE, plot_shuffle_haplotype_attempts = FALSE, plotAfterImputation = TRUE, save_sampleReadsInfo = FALSE, gridWindowSize = NA, shuffle_bin_nSNPs = NULL, shuffle_bin_radius = 5000, keepSampleReadsInRAM = FALSE, useTempdirWhileWriting = FALSE, output_haplotype_dosages = FALSE)
[2020-03-19 11:50:24] Program start
[2020-03-19 11:50:24] Get and validate pos and gen
[2020-03-19 11:50:37] Done get and validate pos and gen
[2020-03-19 11:50:38] There are 30312 variants in the left buffer region 9000001 <= position < 10000001
[2020-03-19 11:50:38] There are 2757748 variants in the central region 10000001 <= position <= 111500000
[2020-03-19 11:50:39] There are 27925 variants in the right buffer region 111500000 < position <= 112500000
[2020-03-19 11:50:39] Generate inputs
[2020-03-19 11:50:44] Done generating inputs
[2020-03-19 11:50:44] Copying files onto tempdir
[2020-03-19 11:50:44] Done copying files onto tempdir
[2020-03-19 11:50:44] Generate allele count
[2020-03-19 11:50:45] Quantiles across SNPs of per-sample depth of coverage
[2020-03-19 11:50:45]     5%   25%   50%   75%   95%
[2020-03-19 11:50:45]  0.000 0.000 0.000 0.000 2.000
[2020-03-19 11:50:45] Done generating allele count
[2020-03-19 11:50:47] Outputting will be done in 276 blocks with on average 9991.8 SNPs in them
[2020-03-19 11:50:47] Begin parameter initialization
[2020-03-19 11:50:47] Begin initializing paramters using reference haplotypes
[2020-03-19 11:50:47] Begin get haplotypes from reference
[2020-03-19 11:50:47] Load and validate reference legend header
[2020-03-19 11:50:47] Load and validate reference legend
[2020-03-19 11:50:47] Keeping 99 out of 2504 reference samples from populations:CEU
[2020-03-19 11:51:55] Extract reference haplotypes
Error in validate_pos_and_legend_snps_for_niterations_equals_1(legend_snps,  :
  You have selected to use reference haplotypes with niterations=1, which requires exact matching of reference legend SNPs and posfile SNPs. However, reference legend SNP with pos-ref-alt 9000146-AC-A was not found in posfile
Calls: STITCH ... validate_pos_and_legend_snps_for_niterations_equals_1
Execution halted
rwdavies commented 4 years ago

Yeah, right, I forgot that you need to saniitize the IMPUTE2 input reference files, as they contain indels etc.

I'm weirdly still having trouble replicating this bug

[2020-03-18 14:46:30] Begin determining haplotypes to use for initialization
Error in a[a > 0.5] <- 1 - a[a > 0.5] :
  NAs are not allowed in subscripted assignments
Calls: STITCH ... get_and_initialize_from_reference -> sample_haps_to_use
Execution halted

what version of STITCH are you using? is it from github (what commit) or a release version?

cwarden45 commented 4 years ago

I believe that I am using STITCH v1.6.2 (with R v3.6)

cwarden45 commented 4 years ago

I have some code that runs the 1st region on chromosome 1 without generating any error messages. I need to make some additional changes:

1) Create alternate set of reference files using filter_IMPUTE2_files.R (with dependency extract_desired_rows.pl) 2) Run STITCH using run_STITCH.R

This is what I think is the successful output:

Loading required package: parallel
Loading required package: rrbgen
[1] "Finding centromere for chromosome 1..."
[1] "Centromere length: 7400000 bp"
[1] "Running STITCH for left-side of 1..."
[2020-03-20 10:21:55] Running STITCH(chr = 1, nGen = 8000, posfile = ../1000_Genomes_BAMs/IMPUTE2_Files/1000GP_Phase3/1_pos.txt, K = 10, S = 1, outputdir = 1KG_IMPUTE2-Nebula_full_nodup_1_left, nStarts = , tempdir = /tmp/RtmpfESj1e, bamlist = Nebula_full_nodup-bams.txt, cramlist = , sampleNames_file = Nebula_full_nodup-names.txt, reference = , genfile = , method = diploid, output_format = bgvcf, B_bit_prob = 16, outputInputInVCFFormat = FALSE, downsampleToCov = 50, downsampleFraction = 1, readAware = TRUE, chrStart = NA, chrEnd = NA, regionStart = 10000001, regionEnd = 111500000, buffer = 1e+06, maxDifferenceBetweenReads = 1000, maxEmissionMatrixDifference = 1e+10, alphaMatThreshold = 1e-04, emissionThreshold = 1e-04, iSizeUpperLimit = 600, bqFilter = 17, niterations = 1, shuffleHaplotypeIterations = NA, splitReadIterations = 25, nCores = 1, expRate = 0.5, maxRate = 100, minRate = 0.1, Jmax = 1000, regenerateInput = TRUE, originalRegionName = NA, keepInterimFiles = FALSE, keepTempDir = FALSE, outputHaplotypeProbabilities = FALSE, switchModelIteration = NA, generateInputOnly = FALSE, restartIterations = NA, refillIterations = NA, downsampleSamples = 1, downsampleSamplesKeepList = NA, subsetSNPsfile = NA, useSoftClippedBases = FALSE, outputBlockSize = 1000, outputSNPBlockSize = 10000, inputBundleBlockSize = NA, genetic_map_file = , reference_haplotype_file = ../1000_Genomes_BAMs/IMPUTE2_Files/1000GP_Phase3/1_hap.txt.gz, reference_legend_file = ../1000_Genomes_BAMs/IMPUTE2_Files/1000GP_Phase3/1_legend.txt.gz, reference_sample_file = ../1000_Genomes_BAMs/IMPUTE2_Files/1000GP_Phase3/1000GP_Phase3.sample, reference_populations = CEU, reference_phred = 20, reference_iterations = 40, reference_shuffleHaplotypeIterations = c(4, 8, 12, 16), output_filename = NULL, initial_min_hapProb = 0.2, initial_max_hapProb = 0.8, regenerateInputWithDefaultValues = FALSE, plotHapSumDuringIterations = FALSE, plot_shuffle_haplotype_attempts = FALSE, plotAfterImputation = TRUE, save_sampleReadsInfo = FALSE, gridWindowSize = NA, shuffle_bin_nSNPs = NULL, shuffle_bin_radius = 5000, keepSampleReadsInRAM = FALSE, useTempdirWhileWriting = FALSE, output_haplotype_dosages = FALSE)
[2020-03-20 10:21:55] Program start
[2020-03-20 10:21:55] Get and validate pos and gen
[2020-03-20 10:22:16] Done get and validate pos and gen
[2020-03-20 10:22:17] There are 30312 variants in the left buffer region 9000001 <= position < 10000001
[2020-03-20 10:22:17] There are 2757748 variants in the central region 10000001 <= position <= 111500000
[2020-03-20 10:22:17] There are 27925 variants in the right buffer region 111500000 < position <= 112500000
[2020-03-20 10:22:17] Generate inputs
[2020-03-20 10:22:21] Done generating inputs
[2020-03-20 10:22:21] Copying files onto tempdir
[2020-03-20 10:22:21] Done copying files onto tempdir
[2020-03-20 10:22:21] Generate allele count
[2020-03-20 10:22:23] Quantiles across SNPs of per-sample depth of coverage
[2020-03-20 10:22:23]     5%   25%   50%   75%   95%
[2020-03-20 10:22:23]  0.000 0.000 0.000 0.000 2.000
[2020-03-20 10:22:23] Done generating allele count
[2020-03-20 10:22:24] Outputting will be done in 276 blocks with on average 9991.8 SNPs in them
[2020-03-20 10:22:24] Begin parameter initialization
[2020-03-20 10:22:25] Begin initializing paramters using reference haplotypes
[2020-03-20 10:22:25] Begin get haplotypes from reference
[2020-03-20 10:22:25] Load and validate reference legend header
[2020-03-20 10:22:25] Load and validate reference legend
[2020-03-20 10:22:25] Keeping 99 out of 2504 reference samples from populations:CEU
[2020-03-20 10:23:37] Extract reference haplotypes
[2020-03-20 10:23:42] In the region to be imputed plus buffer there are the following number of SNPs:
[2020-03-20 10:23:42] 2815985 from the posfile
[2020-03-20 10:23:42] 2815985 from the reference haplotypes
[2020-03-20 10:23:42] 2815985 in the intersection of the posfile and the reference haplotypes
[2020-03-20 10:28:50] Succesfully extracted 198 haplotypes from reference data
[2020-03-20 10:28:50] End get haplotypes from reference
[2020-03-20 10:28:50] The following correlations are observed between allele frequencies estimated from the sequencing pileup and the reference haplotype counts:
[2020-03-20 10:28:50] 0.364 for all SNPs
[2020-03-20 10:28:50] 0.02 for > 5% MAF SNPs
[2020-03-20 10:28:50] 0.36 for < 5% MAF SNPs
[2020-03-20 10:28:50] A plot of allele frequencies from sequencing pileup vs reference haplotype counts is at:1KG_IMPUTE2-Nebula_full_nodup_1_left/plots/alleleFrequency_pileup_vs_reference_haplotypes.1.10000001.111500000.png
[2020-03-20 10:29:16] Begin determining haplotypes to use for initialization
[2020-03-20 10:29:24] Perform K-means
[2020-03-20 10:29:24] Done performing K-means
[2020-03-20 10:29:24] Done determining haplotypes to use for initialization
[2020-03-20 10:29:25] Running reference EM with 198 reference haplotypes
[2020-03-20 10:29:25] Begin EM using reference haplotypes
[2020-03-20 10:29:25] Reference iteration = 1
[2020-03-20 10:31:55] Reference iteration = 2
[2020-03-20 10:34:24] Reference iteration = 3
[2020-03-20 10:39:53] Reference iteration = 4
[2020-03-20 10:42:32] Shuffle haplotypes - Iteration 4 - change on average 2015 intervals out of 5203 considered
[2020-03-20 10:42:32] Reference iteration = 5
[2020-03-20 10:45:01] Reference iteration = 6
[2020-03-20 10:47:32] Reference iteration = 7
[2020-03-20 10:53:02] Reference iteration = 8
[2020-03-20 10:55:39] Shuffle haplotypes - Iteration 8 - change on average 1842 intervals out of 5486 considered
[2020-03-20 10:55:39] Reference iteration = 9
[2020-03-20 10:58:10] Reference iteration = 10
[2020-03-20 11:00:41] Reference iteration = 11
[2020-03-20 11:06:16] Reference iteration = 12
[2020-03-20 11:08:52] Shuffle haplotypes - Iteration 12 - change on average 1547 intervals out of 5458 considered
[2020-03-20 11:08:52] Reference iteration = 13
[2020-03-20 11:11:21] Reference iteration = 14
[2020-03-20 11:13:53] Reference iteration = 15
[2020-03-20 11:19:36] Reference iteration = 16
[2020-03-20 11:22:12] Shuffle haplotypes - Iteration 16 - change on average 1401 intervals out of 5313 considered
[2020-03-20 11:22:12] Reference iteration = 17
[2020-03-20 11:24:41] Reference iteration = 18
[2020-03-20 11:27:09] Reference iteration = 19
[2020-03-20 11:29:37] Reference iteration = 20
[2020-03-20 11:32:07] Reference iteration = 21
[2020-03-20 11:34:37] Reference iteration = 22
[2020-03-20 11:37:07] Reference iteration = 23
[2020-03-20 11:39:38] Reference iteration = 24
[2020-03-20 11:42:07] Reference iteration = 25
[2020-03-20 11:44:37] Reference iteration = 26
[2020-03-20 11:47:07] Reference iteration = 27
[2020-03-20 11:49:38] Reference iteration = 28
[2020-03-20 11:52:07] Reference iteration = 29
[2020-03-20 11:54:37] Reference iteration = 30
[2020-03-20 11:57:07] Reference iteration = 31
[2020-03-20 11:59:36] Reference iteration = 32
[2020-03-20 12:02:06] Reference iteration = 33
[2020-03-20 12:04:36] Reference iteration = 34
[2020-03-20 12:07:05] Reference iteration = 35
[2020-03-20 12:09:36] Reference iteration = 36
[2020-03-20 12:12:05] Reference iteration = 37
[2020-03-20 12:14:36] Reference iteration = 38
[2020-03-20 12:17:05] Reference iteration = 39
[2020-03-20 12:19:35] Reference iteration = 40
[2020-03-20 12:22:07] Done EM using reference haplotypes
[2020-03-20 12:22:07] Done initializing paramters using reference haplotypes
[2020-03-20 12:22:07] Done parameter initialization
[2020-03-20 12:22:07] Start EM
[2020-03-20 12:22:07] Number of samples: 1
[2020-03-20 12:22:07] Number of SNPs: 2815985
[2020-03-20 12:22:07] Start of iteration 1
[2020-03-20 12:22:16] End EM
[2020-03-20 12:22:16] Begin making and writing output file
[2020-03-20 12:22:16] Determine reads in output blocks
[2020-03-20 12:22:17] Done determining reads in output blocks
[2020-03-20 12:22:17] Initialize output file
[2020-03-20 12:22:17] Done initializing output file
[2020-03-20 12:22:17] Loop over and write output file
[2020-03-20 12:22:17] Making output piece 1 / 276
[2020-03-20 12:22:38] Making output piece 32 / 276
[2020-03-20 12:22:57] Making output piece 62 / 276
[2020-03-20 12:23:18] Making output piece 93 / 276
[2020-03-20 12:23:38] Making output piece 123 / 276
[2020-03-20 12:23:59] Making output piece 154 / 276
[2020-03-20 12:24:18] Making output piece 184 / 276
[2020-03-20 12:24:39] Making output piece 215 / 276
[2020-03-20 12:24:59] Making output piece 245 / 276
[2020-03-20 12:25:19] Making output piece 276 / 276
[2020-03-20 12:25:20] Done looping over and writing output file
[2020-03-20 12:25:20] bgzip output file and move to final location
[2020-03-20 12:25:24] Done making and writing output file
[2020-03-20 12:25:51] Remove buffer region from output
[2020-03-20 12:25:52] Save RData objects to disk
[2020-03-20 12:26:19] Make metrics plot
[2020-03-20 12:28:22] Make estimated against real
[2020-03-20 12:28:45] Make other plots
[2020-03-20 12:29:54] Clean up and end
[2020-03-20 12:29:54] Program done

The run-time is noticeably longer with more variants, but I will also post some more information when I have some comparisons with my other genomic data.

cwarden45 commented 4 years ago

I am going to do some additional testing with downsampling (which was the original goal) as well as using extra reference samples. However, using 99 CEU reference samples gives performance similar to Gencove (even though it may currently be hard to see, since the maroon dot looks like a gray dot when it overlaps the partially transparent green region):

estimated genotype recovery

I will also provide an updated figure after I have completed those other comparisons. However, I think this currently looks good so far!