naumanjaved / fingerprint_maps

Create "haplotype maps" of variants in order to use with Picardtools Crosscheck_Files utility, which allows for robust genotyping of functional genomic data.
24 stars 9 forks source link

iOError followed by StopIteration errors #4

Closed fpbarthel closed 2 years ago

fpbarthel commented 6 years ago

I'm having some trouble running this. Originally I was getting the following error:

IOError: [Errno 2] No such file or directory: '.../output/1.log'

That error was resolved by creating the output directory (mkdir output)

Now I am getting the error:

tail: cannot open `.../intermediates/1.clumped' for reading: No such file or directory

When I create the directory (mkdir intermediates), the error does not go away. If I touch this file (touch intermediates/1.clumped), I get the error:

IOError: [Errno 2] No such file or directory: '.../intermediates/1.recode_u.vcf'

When I touch that too, I get a StopIteration error.

Any tips on how to run this? Also, is this ran individually for each chromosome? The input format described here does not seem to be for a single chromosome only? Or is a single chromosome sufficient for fingerprinting?

fpbarthel commented 6 years ago

There seems to be a large block of commented code in map_building_functions.py here. Un-commenting this code results in the runtime error:

NameError: global name 'SNPs' is not defined
naumanjaved commented 6 years ago

Apologies for the delay in my response and the error. It seems I committed an incorrect version of the two scripts earlier that had some portions commented. The current version should work now.

For a quick fix, you can fix the second error

NameError: global name 'SNPs' is not defined

by uncommenting lines 36, 47, 48, and 65 in map_building_functions.py.

fpbarthel commented 6 years ago

Thanks! Do I need to run this individually for each chromosome and merge the resulting VCF file, OR is a single chromosome sufficient (for fingerprinting)? The input VCF files ( downloaded from 1000genomes FTP site as suggested ) are separated by contig and there is also a --chromosome parameter to the python script.

naumanjaved commented 6 years ago

The script needs to be run on each chromosome individually. To get a final map file, merge the map files generated for each chromosome(or however many you want to use), and sort it on both column 1 and 2. Then cat the provided header file to the beginning.

Whether or not a single chromosome is sufficient for fingerprinting depends on the characteristics of your input data(i.e. assay type, coverage, etc...). For ENCODE fingerprinting, we use all chromosomes.

fpbarthel commented 6 years ago

Thanks that works. It does look like the first step build.extract_similar_SNPs takes a very long time, running over 70 hours for most chromosomes, is that normal? Also the second step that follows build.create_VCFs, seems to give an error because the input VCF files are of .gz extension and the input parameters should be changed, but this can easily be fixed (vcftools --vcfgz instead of --vcf). Did you uncompress the 1000 genomes input VCFs (eg. ALL.chr9.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz)?

naumanjaved commented 6 years ago

Hi,

Yes, the first step does take a very long time as the function iterates through every SNP in the VCF file. There are clearly some optimizations to be made here. I did uncompress the VCF files when I ran the script - I am glad that you were able to easily resolve the issue!

Nauman

fpbarthel commented 6 years ago

I was able to run the entire workflow successfully, but I ended up rewriting it into a snakemake workflow and referencing the map_building_functions from the workflow to make it a tad easier to multitask parallel workflow components and parallelize multiple chromosomes.

1) The first step is clearly the biggest bottleneck. Any plans to optimize the extract_similar_SNPs step? Is this even possible using python? 2) Because of this bottleneck, a small number of chromosomes did not finish the first step within 7 days on non-stop running. Nevertheless, an (incomplete?) list of sites was output regardless. Is this potentially incomplete list safe to use? It has no problem processing through subsequent steps. 3) Is there any downside in excluding the X-chromosome for fingerprinting using Picard/GATK? So far I am only using chromosomes 1-22.

hupef commented 6 years ago

Hi, Can the first step split a vcf , use multithreading to run extract_similar_SNPs , then cat them ,deduplicate SNPs ?

naumanjaved commented 6 years ago

Yes that should work.

Extract_similar_snps will also run faster if the original VCFs are filtered(for bi-allelic, phased SNPs with the desired minimal allele fraction) before similar snps are extracted. This will reduce the number of SNPs that to iterate over. Will commit this change soon.

fpbarthel commented 6 years ago

I noticed that the output file is being opened and closed several million times depending on the number of records in the VCF while at the same time the record name is being saved to memory (SNPs variable)

See line 63 https://github.com/naumanjaved/fingerprint_maps/blob/master/map_building_functions.py#L63

Not an expert in python but figure this might create some slowdown. I moved the python open statement outside of the loop and then write the contents of the SNPs list. Running now to see if this improves the speed at all.

hupef commented 6 years ago

I want to know how get a new header file . I use the ucsc.hg19.fasta as reference ,but I got a warning "Exception thrown in thread:Sequence dictionaries are not the same size (85, 25)" when running CrosscheckFingerprints . Are there any suggestions to solve it?

naumanjaved commented 6 years ago

I have a fork of Picard(https://github.com/naumanjaved/picard) in which I have relaxed the header/dictionary size requirements. You can try to use this and see whether you are able to process your bams