rwdavies / STITCH

STITCH - Sequencing To Imputation Through Constructing Haplotypes
http://www.nature.com/ng/journal/v48/n8/abs/ng.3594.html
GNU General Public License v3.0
77 stars 17 forks source link

Running STITCH with Only BAM Files #33

Closed cwarden45 closed 4 years ago

cwarden45 commented 4 years ago

Hi,

I have a different issue where I am trying to test creating imputations with a new .bam file and a table of genotypes from reference files.

However, I thought perhaps testing STITCH when I only provide .bam files (without "reference" parameters) may help me understand the other issue and/or it may help to have 2 sets of STITCH imputations to compare.

Here is the current command that I am trying to run:

bam_list = "28_1000_Genomes-plus-Nebula_provided-bams.txt"
name_list = "28_1000_Genomes-plus-Nebula_provided-names.txt"
output_folder = "STITCH-all_bam-Nebula_provided"

library("STITCH")
#from https://github.com/rwdavies/STITCH/blob/master/examples/examples.R
human_K = 10
human_nGen = 4 * 20000 / human_K
human_posfile = "../STITCH_Gencove/STITCH_human_example_2016_10_18/pos.txt"
temp_folder= "temp"

system(paste("mkdir ",temp_folder,sep=""))

STITCH(
  bamlist = bam_list,
  sampleNames_file = name_list,
  outputdir = output_folder,
  posfile = human_posfile,
  method = "diploid",
  regenerateInput = TRUE,
  regionStart = 1000000,
  regionEnd = 1100000,
  buffer = 10000,
  niterations = 1,
  chr = "20",
  inputBundleBlockSize = 100,
  shuffleHaplotypeIterations = NA,
  refillIterations = NA,
  K = human_K, tempdir = temp_folder, nCores = 1, nGen = human_nGen)

And this is the error message that I am getting:

Loading required package: parallel
Loading required package: rrbgen
mkdir: cannot create directory ‘temp’: File exists
[2020-03-14 15:22:07] Running STITCH(chr = 20, nGen = 8000, posfile = ../STITCH_Gencove/STITCH_human_example_2016_10_18/pos.txt, K = 10, S = 1, outputdir = STITCH-all_bam-Nebula_provided, nStarts = , tempdir = temp, bamlist = 28_1000_Genomes-plus-Nebula_provided-bams.txt, cramlist = , sampleNames_file = 28_1000_Genomes-plus-Nebula_provided-names.txt, reference = , genfile = , method = diploid, output_format = bgvcf, B_bit_prob = 16, outputInputInVCFFormat = FALSE, downsampleToCov = 50, downsampleFraction = 1, readAware = TRUE, chrStart = NA, chrEnd = NA, regionStart = 1e+06, regionEnd = 1100000, buffer = 10000, maxDifferenceBetweenReads = 1000, maxEmissionMatrixDifference = 1e+10, alphaMatThreshold = 1e-04, emissionThreshold = 1e-04, iSizeUpperLimit = 600, bqFilter = 17, niterations = 1, shuffleHaplotypeIterations = NA, splitReadIterations = 25, nCores = 1, expRate = 0.5, maxRate = 100, minRate = 0.1, Jmax = 1000, regenerateInput = TRUE, originalRegionName = NA, keepInterimFiles = FALSE, keepTempDir = FALSE, outputHaplotypeProbabilities = FALSE, switchModelIteration = NA, generateInputOnly = FALSE, restartIterations = NA, refillIterations = NA, downsampleSamples = 1, downsampleSamplesKeepList = NA, subsetSNPsfile = NA, useSoftClippedBases = FALSE, outputBlockSize = 1000, outputSNPBlockSize = 10000, inputBundleBlockSize = 100, genetic_map_file = , reference_haplotype_file = , reference_legend_file = , reference_sample_file = , reference_populations = NA, reference_phred = 20, reference_iterations = 40, reference_shuffleHaplotypeIterations = c(4, 8, 12, 16), output_filename = NULL, initial_min_hapProb = 0.2, initial_max_hapProb = 0.8, regenerateInputWithDefaultValues = FALSE, plotHapSumDuringIterations = FALSE, plot_shuffle_haplotype_attempts = FALSE, plotAfterImputation = TRUE, save_sampleReadsInfo = FALSE, gridWindowSize = NA, shuffle_bin_nSNPs = NULL, shuffle_bin_radius = 5000, keepSampleReadsInRAM = FALSE, useTempdirWhileWriting = FALSE, output_haplotype_dosages = FALSE)
[2020-03-14 15:22:07] Program start
[2020-03-14 15:22:07] Get and validate pos and gen
[2020-03-14 15:22:07] Done get and validate pos and gen
[2020-03-14 15:22:07] There are 89 variants in the left buffer region 990000 <= position < 1000000
[2020-03-14 15:22:07] There are 858 variants in the central region 1000000 <= position <= 1100000
[2020-03-14 15:22:07] There are 67 variants in the right buffer region 1100000 < position <= 1110000
[2020-03-14 15:22:07] Generate inputs
[2020-03-14 15:22:08] Done generating inputs
[2020-03-14 15:22:08] Copying files onto tempdir
sh: 1: cannot create temp/XTJVAJQMXH/files_to_transfer.txt: Directory nonexistent
rsync: failed to open files-from file temp/XTJVAJQMXH/files_to_transfer.txt: No such file or directory
rsync error: syntax or usage error (code 1) at main.c(1585) [client=3.1.2]
[2020-03-14 15:22:08] Done copying files onto tempdir
[2020-03-14 15:22:08] Generate allele count
Error in readChar(con, 5L, useBytes = TRUE) : cannot open the connection
Calls: STITCH ... get_sampleReads_from_dir_for_sample -> load -> readChar
In addition: Warning message:
In readChar(con, 5L, useBytes = TRUE) :
  cannot open compressed file 'temp/XTJVAJQMXH/bundledSamples.1-28.20.1000000.1100000.RData', probable reason 'No such file or directory'
Execution halted

I also have some additional information about the file preparation here, but I think this involves providing fewer details for troubleshooting that that other example.

Can you please help me troubleshoot this issue?

Thank You, Charles

cwarden45 commented 4 years ago

I found that I can resolve this particular issue with the following change:

temp_folder = tempdir()

For some reason, STITCH had a problem if I tried to define a "temp" subfolder in the same directory (which I could readily delete).

This causes the program to run without error messages.

I still have some questions (such as where to find position file that works best with STITCH across all chromosomes), but I will first test using a set of positions that I have compared with the 1000 Genomes Omni SNP chip. I also see a potentially useful function with the downsampleFraction parameter, but I will also create a separate issue if I start testing that and I encounter a problem.

rwdavies commented 4 years ago

This definitely seems to be a bug, I'm looking through it now

rwdavies commented 4 years ago

argh, github can't interpret "should" before "fix". we'll see if this passes tests

rwdavies commented 4 years ago

OK, this does indeed seem to be fixed now

cwarden45 commented 4 years ago

Thank you for following up. I am glad that you were able to find and fix the bug.