secastel / phaser

phasing and Allele Specific Expression from RNA-seq
GNU General Public License v3.0
105 stars 36 forks source link

Additional Commits #47

Closed everestial closed 5 years ago

everestial commented 6 years ago

Method added to merge multiple files by chromosome to one.

secastel commented 6 years ago

I've had the time to look over the changes, and I don't think I'm comfortable including them in the master branch of phASER. I think for the majority of users a better solution is to use e.g. bash scripting to run phASER per chromosome and then merge the resulting output files, rather than modifying the code of phASER itself. Furthermore, there are issues with splitting chromsomes into chunks and running phASER on those chunks, as is done in this code. First, parameters are estimated which are used during read backed phasing, that when split, are only being estimated on a part of the data, and then will differ even on the same chromosome. Second, any breaks in a chromosome mean that variants cannot be phased across those breaks. While this is likely a small number of cases, it is still an issue nonetheless. In light of this, and other issues related to ease of future development, as I said, I think I will keep this functionality out of the master phASER branch. An alternative would be to adapt the changes you've made here as a standalone script, which calls the unchanged phaser.py script for each of the detected chromosomes, and then merges their output. I think this would be a much more flexible approach, and I'd be happy to include that code in the master branch of phASER. If you don't feel that is worthwhile, then by all means, continue using this code for your own purposes and maintaining it on your branch. Thank you again for all the effort you've put into this, and I hope it gets phASER working well for your purposes.

everestial commented 6 years ago

@secastel This is quite a long reply to you. But, I am hopeful it is more helpful than not.

I've had the time to look over the changes, and I don't think I'm comfortable including them in the master branch of phASER. I think for the majority of users a better solution is to use e.g. bash scripting to run phASER per chromosome and then merge the resulting output files, rather than modifying the code of phASER itself.

Coming from a biology background I feel that majority of the emperical biologist aren't even comfortable at adding extra bash script with a for loop for each chromosome and sample. So, I thought it would be helpful to put the code right inside phaser.

In addition I just reviewed my codes and results and have to say that the proposed code changes are in best interest of the phaser software itself. And, the code changes are verified with the detailed results.

See the details below.

Furthermore, there are issues with splitting chromsomes into chunks and running phASER on those chunks, as is done in this code. First, parameters are estimated which are used during read backed phasing, that when split, are only being estimated on a part of the data, and then will differ even on the same chromosome.

I wish I was wrong on this but I am not. I just verified the code changes, results it produced (using my code changes) and compared it with the results from phaser (original) in per chromosome mode.

Tip 01: This code in the original phaser:

if args.chr != "":
    decomp_str = "tabix -h "+args.vcf+" "+args.chr+":"
else:
    decomp_str = "gunzip -c "+args.vcf;

actually strips offs all the vcf sites from other contigs if --chr mode is activated. In that case the statistics it calculates i.e number of heterozygote sites, number of indels, number of unphased sites, alignment score cutoff, retrieved reads, sequence noise level, number of variants covered by at least 1 read, phased variants of all variants, etc. are only calculated for that chromosome. But, when you have the --chr not active then it will compute the above statistics in the global mode.

Conclusion: Running phaser in per chromosome mode is going to produce the stats that is little different anyway - which was your concern.


Second, any breaks in a chromosome mean that variants cannot be phased across those breaks. While this is likely a small number of cases, it is still an issue nonetheless. In light of this, and other issues related to ease of future development, as I said, I think I will keep this functionality out of the master phASER branch.

Splitting VCF by chromosome, doesn't create a break with in chromosome. If you can elaborate on it.


Tip 02: So, what I did was just replicated the above process in memory efficient mode. My code changes produces same exact result that it would have produced if phaser (original) was run in per chromosome mode. See the verification below,

Orginal phaser.py in per chromosome mode:

for chromosome 2


$ python phaserOrg.py --bam realigned_ms01e.chr2n3.bam --vcf phased.MySpF1.chr2n3.vcf.gz --sample ms01e --o ResultOrgPhaser/ms01e_chr2 --mapq 40 --baseq 10 --paired_end 1 --id_separator - --pass_only 0 --chr 2

##################################################
              Welcome to phASER v1.0.1
  Author: Stephane Castel (scastel@nygenome.org)
##################################################

#1. Loading heterozygous variants into intervals...
     restricting to chromosome '2'...
     processing VCF...
     creating variant mapping table...
          23840 heterozygous sites being used for phasing (0 filtered, 1898 indels excluded, 25738 unphased)
#2. Retrieving reads that overlap heterozygous sites...
     file: realigned_ms01e.chr2n3.bam
          minimum mapq: 40
          mapping reads to variants...
               completed chromosome 2...
          processing mapped reads...
          using alignment score cutoff of 157
          retrieved 1120185 reads
#3. Identifying connected variants...
     calculating sequencing noise level...
     sequencing noise level estimated at 0.000445
     creating read sets...
     generating read connectivity map...
     testing variant connections versus noise...
     2109 variant connections dropped because of conflicting configurations (threshold = 0.010000)
     23413 variants covered by at least 1 read
#4. Identifying haplotype blocks...
#5. Phasing blocks...
#6. Outputting haplotypes...
#7. Outputting phased VCF...
     GT field is not being updated with phASER genome wide phase. This can be changed using the --gw_phase_vcf argument.
     Compressing and tabix indexing output VCF...
COMPLETED using 1120185 reads in 69 seconds using 1 threads
     PHASED  22679 of 23840 all variants (= 0.951300) with at least one other variant
     GENOME WIDE PHASED  0 of 25738 unphased variants (= 0.000000)
     GENOME WIDE PHASE CORRECTED  0 of 23840 variants (= 0.000000)
     Global maximum memory usage: 515.45 (mb)

for chromosome 3

$ python phaserOrg.py --bam realigned_ms01e.chr2n3.bam --vcf phased.MySpF1.chr2n3.vcf.gz --sample ms01e --o ResultOrgPhaser/ms01e_chr3 --mapq 40 --baseq 10 --paired_end 1 --id_separator - --pass_only 0 --chr 3

################################################## Welcome to phASER v1.0.1 Author: Stephane Castel (scastel@nygenome.org) ##################################################

1. Loading heterozygous variants into intervals...

 restricting to chromosome '3'...
 processing VCF...
 creating variant mapping table...
      32091 heterozygous sites being used for phasing (0 filtered, 2594 indels excluded, 34685 unphased)

2. Retrieving reads that overlap heterozygous sites...

 file: realigned_ms01e.chr2n3.bam
      minimum mapq: 40
      mapping reads to variants...
           completed chromosome 3...
      processing mapped reads...
      using alignment score cutoff of 159
      retrieved 1749129 reads

3. Identifying connected variants...

 calculating sequencing noise level...
 sequencing noise level estimated at 0.000471
 creating read sets...
 generating read connectivity map...
 testing variant connections versus noise...
 2038 variant connections dropped because of conflicting configurations (threshold = 0.010000)
 31483 variants covered by at least 1 read

4. Identifying haplotype blocks...

5. Phasing blocks...

6. Outputting haplotypes...

7. Outputting phased VCF...

 GT field is not being updated with phASER genome wide phase. This can be changed using the --gw_phase_vcf argument.
 Compressing and tabix indexing output VCF...

COMPLETED using 1749129 reads in 90 seconds using 1 threads PHASED 30499 of 32091 all variants (= 0.950391) with at least one other variant GENOME WIDE PHASED 0 of 34685 unphased variants (= 0.000000) GENOME WIDE PHASE CORRECTED 0 of 32091 variants (= 0.000000) Global maximum memory usage: 721.70 (mb)


Now, see the comparison of the above stats with the stats produced in memory efficient mode. You will see the stats produced are exactly the same. You may compare the output results (VCF files and others).

$ python phaser.py --bam realigned_ms01e.chr2n3.bam --vcf phased.MySpF1.chr2n3.vcf.gz --sample ms01e --o ResultUpdatedPhaser/ms01e --mapq 40 --baseq 10 --paired_end 1 --id_separator - --pass_only 0 --process_slow 1

##################################################
              Welcome to phASER v1.0.1
  Author: Stephane Castel (scastel@nygenome.org)
##################################################

#1. Loading heterozygous variants into intervals...
     using all the chromosomes ...
     processing VCF...

     Processing the readbackphasing in memory efficient mode.
     Splitting the VCF by chromosome/contigs. 

     creating variant mapping table...
          23840 heterozygous sites being used for phasing (0 filtered, 1898 indels excluded, 25738 unphased)
#2. Retrieving reads that overlap heterozygous sites...
     file: realigned_ms01e.chr2n3.bam
          minimum mapq: 40
          mapping reads to variants...
               completed chromosome 2...
          processing mapped reads...
          using alignment score cutoff of 157
          retrieved 1120185 reads
#3. Identifying connected variants...
     calculating sequencing noise level...
     sequencing noise level estimated at 0.000445
     creating read sets...
     generating read connectivity map...
     testing variant connections versus noise...
     2109 variant connections dropped because of conflicting configurations (threshold = 0.010000)
     23413 variants covered by at least 1 read
#4. Identifying haplotype blocks...
#5. Phasing blocks...
#6. Outputting haplotypes...
#7. Outputting phased VCF...
     GT field is not being updated with phASER genome wide phase. This can be changed using the --gw_phase_vcf argument.
     Compressing and tabix indexing output VCF...
COMPLETED using 1120185 reads in 82 seconds using 1 threads
     PHASED  22679 of 23840 all variants (= 0.951300) with at least one other variant
     GENOME WIDE PHASED  0 of 25738 unphased variants (= 0.000000)
     GENOME WIDE PHASE CORRECTED  0 of 23840 variants (= 0.000000)
     Global maximum memory usage: 515.84 (mb)

     creating variant mapping table...
          32091 heterozygous sites being used for phasing (0 filtered, 2594 indels excluded, 34685 unphased)
#2. Retrieving reads that overlap heterozygous sites...
     file: realigned_ms01e.chr2n3.bam
          minimum mapq: 40
          mapping reads to variants...
               completed chromosome 3...
          processing mapped reads...
          using alignment score cutoff of 159
          retrieved 1749129 reads
#3. Identifying connected variants...
     calculating sequencing noise level...
     sequencing noise level estimated at 0.000471
     creating read sets...
     generating read connectivity map...
     testing variant connections versus noise...
     2038 variant connections dropped because of conflicting configurations (threshold = 0.010000)
     31483 variants covered by at least 1 read
#4. Identifying haplotype blocks...
#5. Phasing blocks...
#6. Outputting haplotypes...
#7. Outputting phased VCF...
     GT field is not being updated with phASER genome wide phase. This can be changed using the --gw_phase_vcf argument.
     Compressing and tabix indexing output VCF...
COMPLETED using 1749129 reads in 174 seconds using 1 threads
     PHASED  30499 of 32091 all variants (= 0.950391) with at least one other variant
     GENOME WIDE PHASED  0 of 34685 unphased variants (= 0.000000)
     GENOME WIDE PHASE CORRECTED  0 of 32091 variants (= 0.000000)
     Global maximum memory usage: 741.35 (mb)

#8. Merging the results from several contigs/chromosome ...

Over all conclusion:


An alternative would be to adapt the changes you've made here as a standalone script, which calls the unchanged phaser.py script for each of the detected chromosomes, and then merges their output. I think this would be a much more flexible approach, and I'd be happy to include that code in the master branch of phASER. If you don't feel that is worthwhile, then by all means, continue using this code for your own purposes and maintaining it on your branch. Thank you again for all the effort you've put into this, and I hope it gets phASER working well for your purposes.

I will let you decide how to make phaser software more adaptive. I am totally fine with how you want to accept and put the changes, but making less work to an empirical biologist would be helpful. If there are better way to split the VCF and/or BAM files that could be added too.


PS:

Let me know if you want to discuss further !

secastel commented 5 years ago

Thanks for the detailed response here - I've been very busy over the summer so haven't gotten the chance to look over this - I apologize. Perhaps I misunderstood the changes - I thought that with your code a chromosome could possible get split into chunks, which shouldn't happen. If this it is the case that this will never happen, then this should be okay. I tried to test out the code, and ran into an issue however. It seems that when the split result VCFs are merged, they are not done so in a way that keeps the chromosome order properly sorted.

8. Merging the results from several contigs/chromosome ...

[ti_index_core] the chromosome blocks not continuous at line 1983169, is the file sorted? [pos 61334] Traceback (most recent call last): File "phaser/phaser/phaser_slow.py", line 2284, in main(); File "phaser/phaser/phaser_slow.py", line 338, in main subprocess.check_call(tabix_cmd, shell=True, executable='/bin/bash') File "/usr/lib64/python2.7/subprocess.py", line 186, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'tabix -f -p vcf test_results/NA06986.vcf.gz' returned non-zero exit status 1

Would you be able to update your code so that this is the case? The order of merging the split VCFs should be equivalent to the tool sort, for example:

ls SplitVCFNA06986/ | sort 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X

If you're able to do this then I can test out the proposed changed code and make sure it produces the same results.

everestial commented 5 years ago

@secastel I only had a chance to look back at this issue. I can definitely make the code changes to incorporate the proper sorting. Thanks,

everestial commented 5 years ago

HI @secastel I have updated phaser with following changes:

- added time, memory tracker
- upgraded with "--process_slow " flag to process each contig/chromosome separately under memory efficient mode 
- added code changes to write "PI" value (in VCF file's FORMAT field) in increasing order.
- added code changes to write haplotype and other text files in sorted order (by genomic position).
- While under "--process slow = 1" mode, it merges the VCFs (for each contig/chromosome) and sorts the final VCF using bcftools, which should bring the required order back. 

I have updated this branch with the final tested code. The test comparison is shown below. The updated python code file and input test files along with final outputs are available via dropbox. Note: The final updated phaser is available as file "phaserV3.py" in the shared dropbox folder - https://www.dropbox.com/sh/mlvoegw1hqq5g5k/AADhYUFHLpjNxz3zVjVUZCrxa?dl=0

Below is the comparison of how the original phaser (your latest version) compares to the code changes in this branch.

Result comparison from several "phaser" versions

1) Latest original phaser as available under https://github.com/secastel/phaser/blob/master/phaser/phaser.py

$ python phaserOrg2018.py --bam realigned_ms01e.chr2n3.bam --vcf phased.MySpF1.chr2n3.vcf.gz --sample ms01e --o ResultFromOrgPhaser/ms01e --mapq 40 --baseq 10 --paired_end 1 --id_separator - --pass_only 0

##################################################
              Welcome to phASER v1.0.1
  Author: Stephane Castel (scastel@nygenome.org)
##################################################

#1. Loading heterozygous variants into intervals...
     processing VCF...
     creating variant mapping table...
          55931 heterozygous sites being used for phasing (0 filtered, 4492 indels excluded, 60423 unphased)
#2. Retrieving reads that overlap heterozygous sites...
     file: realigned_ms01e.chr2n3.bam
          minimum mapq: 40
          mapping reads to variants...
               completed chromosome 3...
               completed chromosome 2...
          processing mapped reads...
          using alignment score cutoff of 158
          retrieved 2870876 reads
#3. Identifying connected variants...
     calculating sequencing noise level...
     sequencing noise level estimated at 0.000462
     creating read sets...
     generating read connectivity map...
     testing variant connections versus noise...
     4136 variant connections dropped because of conflicting configurations (threshold = 0.010000)
     54890 variants covered by at least 1 read
#4. Identifying haplotype blocks...
#5. Phasing blocks...
#6. Outputting haplotypes...
#7. Outputting phased VCF...
     GT field is not being updated with phASER genome wide phase. This can be changed using the --gw_phase_vcf argument.
     Compressing and tabix indexing output VCF...
COMPLETED using 2870876 reads in 165 seconds using 1 threads
     PHASED  53185 of 55931 all variants (= 0.950904) with at least one other variant
     GENOME WIDE PHASED  0 of 60423 unphased variants (= 0.000000)
     GENOME WIDE PHASE CORRECTED  0 of 55931 variants (= 0.000000)


2) Original phaser with memory and time tracker embeded

$ python phaserOrgWithTimeNMemTracker.py --bam realigned_ms01e.chr2n3.bam --vcf phased.MySpF1.chr2n3.vcf.gz --sample ms01e --o ResultFromOrgPhaser02/ms01e --mapq 40 --baseq 10 --paired_end 1 --id_separator - --pass_only 0

##################################################
              Welcome to phASER v1.0.1
  Author: Stephane Castel (scastel@nygenome.org)
##################################################

DATE, TIME : 2018-11-30, 18:47:31

#1. Loading heterozygous variants into intervals...
     processing VCF...
     creating variant mapping table...
          55931 heterozygous sites being used for phasing (0 filtered, 4492 indels excluded, 60423 unphased)
#2. Retrieving reads that overlap heterozygous sites...
     file: realigned_ms01e.chr2n3.bam
          minimum mapq: 40
          mapping reads to variants...
               completed chromosome 3...
               completed chromosome 2...
          processing mapped reads...
          using alignment score cutoff of 158
          retrieved 2870876 reads
#3. Identifying connected variants...
     calculating sequencing noise level...
     sequencing noise level estimated at 0.000462
     creating read sets...
     generating read connectivity map...
     testing variant connections versus noise...
     4136 variant connections dropped because of conflicting configurations (threshold = 0.010000)
     54890 variants covered by at least 1 read
#4. Identifying haplotype blocks...
#5. Phasing blocks...
#6. Outputting haplotypes...
#7. Outputting phased VCF...
     GT field is not being updated with phASER genome wide phase. This can be changed using the --gw_phase_vcf argument.
     Compressing and tabix indexing output VCF...
COMPLETED using 2870876 reads in 165 seconds using 1 threads
     PHASED  53185 of 55931 all variants (= 0.950904) with at least one other variant
     GENOME WIDE PHASED  0 of 60423 unphased variants (= 0.000000)
     GENOME WIDE PHASE CORRECTED  0 of 55931 variants (= 0.000000)
     Global maximum memory usage: 1146.10 (mb)

COMPLETED "Read backed phasing" of sample ms01e in 00:02:45 hh:mm:ss
DATE, TIME : 2018-11-30, 18:50:16


3) phaser (phaserV3.py) with multiple features upgraded

- added time, memory tracker
- upgraded with "--process_slow " flag to process each contig/chromosome separately under memory efficient mode 
- added code changes to write "PI" value (in VCF file's FORMAT field) in increasing order.
- added code changes to write haplotype and other text files in sorted order (by genomic position).
- While under "--process slow = 1" mode, it merges the VCFs (for each contig/chromosome) and sorts the final VCF using bcftools, which should bring the required order back. 


3-A) phaseV3 without memory efficient mode active

$ python phaserV3.py --bam realigned_ms01e.chr2n3.bam --vcf phased.MySpF1.chr2n3.vcf.gz --sample ms01e --o ResultFromUpdatedPhaserNonME/ms01e --mapq 40 --baseq 10 --paired_end 1 --id_separator - --pass_only 0

##################################################
              Welcome to phASER v1.0.1
  Author: Stephane Castel (scastel@nygenome.org)
  Updated by: Bishwa K. Giri (bkgiri@uncg.edu)
##################################################

Completed the check of dependencies and input files availability... 

STARTED "Read backed phasing and ASE/haplotype analyses" ... 
    DATE, TIME : 2018-11-30, 18:36:31
#1. Loading heterozygous variants into intervals...
Processing sample named ms01e
    using all the chromosomes ...
    processing VCF...

    Memory efficient mode is deactivated...
    If RAM is limited, activate memory efficient mode using the flag "--process_slow = 1"...

     creating variant mapping table...
          55931 heterozygous sites being used for phasing (0 filtered, 4492 indels excluded, 60423 unphased)

#2. Retrieving reads that overlap heterozygous sites...
     file: realigned_ms01e.chr2n3.bam
          minimum mapq: 40
          mapping reads to variants...
               completed chromosome 2...
               completed chromosome 3...
          processing mapped reads...
          using alignment score cutoff of 158
          retrieved 2870876 reads
#3. Identifying connected variants...
     calculating sequencing noise level...
     sequencing noise level estimated at 0.000462
     creating read sets...
     generating read connectivity map...
     testing variant connections versus noise...
     4136 variant connections dropped because of conflicting configurations (threshold = 0.010000)
     54890 variants covered by at least 1 read
#4. Identifying haplotype blocks...
#5. Phasing blocks...
#6. Outputting haplotypes...
#7. Outputting phased VCF...
     GT field is not being updated with phASER genome wide phase. This can be changed using the --gw_phase_vcf argument.
     Compressing and tabix indexing output VCF...

     COMPLETED using 2870876 reads in 208 seconds using 1 threads
     PHASED  53185 of 55931 all variants (= 0.950904) with at least one other variant
     GENOME WIDE PHASED  0 of 60423 unphased variants (= 0.000000)
     GENOME WIDE PHASE CORRECTED  0 of 55931 variants (= 0.000000)
     Global maximum memory usage: 1668.02 (mb)

COMPLETED "Read backed phasing" of sample ms01e in 00:03:30 hh:mm:ss
DATE, TIME : 2018-11-30, 18:40:02

The End.


3-B) phaserV3 with "memory efficient" mode active (i.e "--process_slow 1")

$ python phaserV3.py --bam realigned_ms01e.chr2n3.bam --vcf phased.MySpF1.chr2n3.vcf.gz --sample ms01e --o ResultFromUpdatedPhaserMemEfficient/ms01e --mapq 40 --baseq 10 --paired_end 1 --id_separator - --pass_only 0 --process_slow 1

##################################################
              Welcome to phASER v1.0.1
  Author: Stephane Castel (scastel@nygenome.org)
  Updated by: Bishwa K. Giri (bkgiri@uncg.edu)
##################################################

Completed the check of dependencies and input files availability... 

STARTED "Read backed phasing and ASE/haplotype analyses" ... 
    DATE, TIME : 2018-11-30, 18:41:10
#1. Loading heterozygous variants into intervals...
Processing sample named ms01e
    using all the chromosomes ...
    processing VCF...

    Memory efficient mode is activated... 
    2 unique contigs/chromosomes found... 
    Running processes for each chromosome separately...

processing chromosome '2' ...
     creating variant mapping table...
          23840 heterozygous sites being used for phasing (0 filtered, 1898 indels excluded, 25738 unphased)

#2. Retrieving reads that overlap heterozygous sites...
     file: realigned_ms01e.chr2n3.bam
          minimum mapq: 40
          mapping reads to variants...
               completed chromosome 2...
          processing mapped reads...
          using alignment score cutoff of 157
          retrieved 1120185 reads
#3. Identifying connected variants...
     calculating sequencing noise level...
     sequencing noise level estimated at 0.000445
     creating read sets...
     generating read connectivity map...
     testing variant connections versus noise...
     2109 variant connections dropped because of conflicting configurations (threshold = 0.010000)
     23413 variants covered by at least 1 read
#4. Identifying haplotype blocks...
#5. Phasing blocks...
#6. Outputting haplotypes...
#7. Outputting phased VCF...
     GT field is not being updated with phASER genome wide phase. This can be changed using the --gw_phase_vcf argument.
     Compressing and tabix indexing output VCF...

     COMPLETED using 1120185 reads in 82 seconds using 1 threads
     PHASED  22679 of 23840 all variants (= 0.951300) with at least one other variant
     GENOME WIDE PHASED  0 of 25738 unphased variants (= 0.000000)
     GENOME WIDE PHASE CORRECTED  0 of 23840 variants (= 0.000000)
     Global maximum memory usage: 747.16 (mb)
     Completed processes for contig/chromosome "2" in 00:01:22 hh:mm:ss

processing chromosome '3' ...
     creating variant mapping table...
          32091 heterozygous sites being used for phasing (0 filtered, 2594 indels excluded, 34685 unphased)

#2. Retrieving reads that overlap heterozygous sites...
     file: realigned_ms01e.chr2n3.bam
          minimum mapq: 40
          mapping reads to variants...
               completed chromosome 3...
          processing mapped reads...
          using alignment score cutoff of 159
          retrieved 1749129 reads
#3. Identifying connected variants...
     calculating sequencing noise level...
     sequencing noise level estimated at 0.000471
     creating read sets...
     generating read connectivity map...
     testing variant connections versus noise...
     2038 variant connections dropped because of conflicting configurations (threshold = 0.010000)
     31483 variants covered by at least 1 read
#4. Identifying haplotype blocks...
#5. Phasing blocks...
#6. Outputting haplotypes...
#7. Outputting phased VCF...
     GT field is not being updated with phASER genome wide phase. This can be changed using the --gw_phase_vcf argument.
     Compressing and tabix indexing output VCF...

     COMPLETED using 1749129 reads in 120 seconds using 1 threads
     PHASED  30499 of 32091 all variants (= 0.950391) with at least one other variant
     GENOME WIDE PHASED  0 of 34685 unphased variants (= 0.000000)
     GENOME WIDE PHASE CORRECTED  0 of 32091 variants (= 0.000000)
     Global maximum memory usage: 1063.06 (mb)
     Completed processes for contig/chromosome "3" in 00:02:00 hh:mm:ss

#8. Merging the results from several contigs/chromosome ...
    - Merging splitted text files *.haplotypes.txt into one file for the given sample "ms01e"
    - Merging splitted text files *.haplotypic_counts.txt into one file for the given sample "ms01e"
    - Concatenating splitted VCFs for sample "ms01e"
Writing to /tmp/bcftools-sort.qtpzJM
Merging 1 temporary files
Cleaning
Done
    - Merging splitted text files *.variant_connections.txt into one file for the given sample "ms01e"
    - Merging splitted text files *.allele_config.txt into one file for the given sample "ms01e"
    - Merging splitted text files *.allelic_counts.txt into one file for the given sample "ms01e"

COMPLETED "Read backed phasing" of sample ms01e in 00:03:37 hh:mm:ss
DATE, TIME : 2018-11-30, 18:44:47

The End.

If the issues still arises in your test file, let me know if you can share it. I think this application is awesome and can be improved further.

Let me know if you have any questions.

everestial commented 5 years ago

I have made some more changes (mainly error checks) on phaser. There are also other changes that can be added after this merge. If you are able to merge the latest update, let me know if you would like them be discussed.

secastel commented 5 years ago

Hey there, thanks for all this work! I ran your new version through my test case and without process_slow it produces the same output. When I use process_slow it produces a different output. I believe the reason for this is that in process_slow mode the sequencing noise estimate is generated per chromosome, instead of using all chromosomes. This parameter is used to decide at what threshold data from variants that have conflicting read-based phase information are discarded. Under ideal circumstances this parameter should be estimated using all chromosomes (as is done with process_slow turned off), however I don't see any way to do that in the current implementation of process slow. I'm okay to include your modifications but there needs to be a warning message that the results will be different, and explaining why. I can add this as a message in stdout when the option is enabled. What do you think?

everestial commented 5 years ago

Hi, Yeah, you can go ahead and accept the commits. Adding warning message to the std output is indeed required and helpful.

Thanks,

secastel commented 5 years ago

Done, thanks again!