rwdavies / STITCH

STITCH - Sequencing To Imputation Through Constructing Haplotypes
http://www.nature.com/ng/journal/v48/n8/abs/ng.3594.html
GNU General Public License v3.0
77 stars 17 forks source link

Error in get_sample_names #41

Open SABiagini opened 4 years ago

SABiagini commented 4 years ago

Hi,

few months ago, I successfully used STITCH for imputing low coverage NIPT data. Now, I'm using it again with more samples (5069). I did not change my scripts or any setting in general and now when I run it I have this message:

Loading required package: parallel Loading required package: rrbgen [2020-09-23 21:21:30] Program start [2020-09-23 21:21:30] Get and validate pos and gen [2020-09-23 21:21:31] Done get and validate pos and gen [2020-09-23 21:21:31] There are 0 variants in the left buffer region -170332 <= position < 79668 [2020-09-23 21:21:31] There are 74081 variants in the central region 79668 <= position <= 4999861 [2020-09-23 21:21:31] There are 3700 variants in the right buffer region 4999861 < position <= 5249861 [2020-09-23 21:21:31] Get BAM sample names [2020-09-23 21:27:38] Done getting BAM sample names Error in get_sample_names(bamlist = bamlist, cramlist = cramlist, nCores = nCores, : There are repeat sample names Calls: STITCH -> get_sample_names Execution halted

I checked samples names over and over but there are no duplicates.

This is my script, it takes some inputs as arguments from a previous script that splits the posfile in windows and passes the information to separate runs of STITCH:

!/usr/bin/Rscript

library("STITCH")

args<-commandArgs(TRUE)

STITCH( chr=args[5], method="pseudoHaploid", bamlist="/ddn1/vol1/staging/leuven/stg_00048/STITCH/Isabelle/bamlist.txt", posfile=paste("/ddn1/vol1/staging/leuven/stg_00048/STITCH/Isabelle/posfile/","posfile_chr",args[5],".",args[4],sep=""), outputdir=paste("/ddn1/vol1/staging/leuven/stg00048/STITCH/Isabelle/","chr",args[5],".",args[4],".",args[1],"",args[2],".",args[3],sep=""), tempdir="/scratch/leuven/331/vsc33170/STITCH_temp/", regionStart = as.numeric(args[2]), regionEnd = as.numeric(args[3]), buffer = 250000, K=30, niterations=40, switchModelIteration=38, output_filename = paste("chr",args[5],".",args[4],".",args[1],".",args[2],".",args[3],".vcf.gz",sep=""), generateInputOnly = FALSE, regenerateInput = TRUE, inputBundleBlockSize = 74, nGen=8000, nCores=8)

I also tried to remove the new samples I added (so basically, I reran the same analysis I made few months ago) and this time I got this:

[2020-09-23 23:21:05] Program start [2020-09-23 23:21:05] Get and validate pos and gen [2020-09-23 23:21:06] Done get and validate pos and gen [2020-09-23 23:21:06] There are 0 variants in the left buffer region -170332 <= position < 79668 [2020-09-23 23:21:06] There are 74081 variants in the central region 79668 <= position <= 4999861 [2020-09-23 23:21:06] There are 3700 variants in the right buffer region 4999861 < position <= 5249861 [2020-09-23 23:21:06] Get BAM sample names [2020-09-23 23:21:27] Done getting BAM sample names [2020-09-23 23:21:27] Generate inputs Error in get_sampleReadsRaw_from_SeqLib(useSoftClippedBases = useSoftClippedBases, : GenomicRegion constructor: Failed to set region for 20:79068-5250446 Calls: STITCH ... loadBamAndConvert -> get_sampleReadsRaw_from_SeqLib Execution halted

Any hint on what the problem might be?

Thank you

rwdavies commented 4 years ago

Hmm can you take a look another way? What happens when you run the following. Here's an example on my machine using the provided test data where for the second instances I add 10 duplicate names and can see that. Can you modify to pass to your bam file and maybe adjust the nCores (I'm inpatient and have a big-ish server for this) and see if there are indeed duplicate names or not

> library("STITCH")                                                                                                                                                        
 > sample_names <- STITCH::get_sample_names(                                                                                                                                
 +     bamlist = "test-data/mouse_data/bamlist.txt",                                                                                                                        
 +     duplicate_name_behaviour = "warn",                                                                                                                                   
 +     nCores = 16,                                                                                                                                                         
 +     save = FALSE                                                                                                                                                         
 + )                                                                                                                                                                        
 [2020-09-25 11:32:41] Get BAM sample names
 [2020-09-25 11:32:42] Done getting BAM sample names
 > table(table(sample_names$bam_files))                                                                                                                                     

    1
 2073
!> bamlist2 <- tempfile()                                                                                                                                                   
 > system(paste0("cat test-data/mouse_data/bamlist.txt > ", bamlist2))                                                                                                      
 > system(paste0("head test-data/mouse_data/bamlist.txt >> ", bamlist2))                                                                                                    
 > library("STITCH")                                                                                                                                                        
 > sample_names <- STITCH::get_sample_names(                                                                                                                                
 +     bamlist = bamlist2,                                                                                                                                                  
 +     duplicate_name_behaviour = "warn",                                                                                                                                   
 +     nCores = 16,                                                                                                                                                         
 +     save = FALSE                                                                                                                                                         
 + )                                                                                                                                                                        
 [2020-09-25 11:32:43] Get BAM sample names
 [2020-09-25 11:32:43] Done getting BAM sample names
 [2020-09-25 11:32:43] Warning, there are repeat sample names
 > table(table(sample_names$bam_files))                                                                                                                                     

    1    2
 2063   10
 > 
SABiagini commented 4 years ago

Hi Robert,

Thank you for the help. Eventually, I made it work!

The issue related to the "repeat sample names" was solved removing 2 samples I identified as responsible of the problem. However, their names are not repeated, so I really don't understand why they were creating the problem.

Regarding the second issue, I found out it was related to the absence of the "chr" before the chromosome number in the posfile. Apparently, this was in conflict with the information in the bam files.

Before finding these solutions, I also tried to change the number of Cores but nothing changed.

I'm still wondering why those two samples were recognized as duplicates.

Does STITCH look at the file names or is it possible that the files have independent names but internally they are duplicates of other samples and STITCH can recognize it? Just an hypothesis..

Thank you very much for your help.

Best, S.