odelaneau / shapeit5

Segmented HAPlotype Estimation and Imputation Tool
https://odelaneau.github.io/shapeit5/
MIT License
61 stars 9 forks source link

phase_common error when output provided without a correct extension #48

Open freeseek opened 1 year ago

freeseek commented 1 year ago

If I don't include an extension in the output, I get some weird behavior from SHAPEIT5:

$ phase_common --input input.bcf --region chr1:0-100000 --output outp

[SHAPEIT5] phase_common (jointly phase multiple common markers)
  * Author        : Olivier DELANEAU, University of Lausanne
  * Contact       : olivier.delaneau@gmail.com
  * Version       : 5.1.1 / commit = 990ed0d / release = 2023-05-08
  * Run date      : 18/07/2023 - 12:00:00

Files:
  * Input         : [input.bcf]
  * Output        : [outp]
  * Output format : [bcf]

Parameters:
  * Seed    : 15052011
  * Threads : 1 threads
  * MCMC    : 15 iterations [5b + 1p + 1b + 1p + 1b + 1p + 5m]
  * PBWT    : [window = 4cM / depth = auto / modulo = auto / mac = 5 / missing = 0.1]
  * HMM     : [window = 4cM / Ne = 15000 / Constant recombination rate of 1cM per Mb]
...
...
...
Finalization:
  * HAP solving (0.03s)
  * HAP update (0.02s)
  * H2V transpose (0.00s)
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 18446744073709551615) > this->size() (which is 4)
Aborted (core dumped)

I believe that the problem arises from xcftools/common/src/utils/xcf.h line:

} else if (hts_fname.size() > 3 && hts_fname.substr(hts_fname.size()-5) == "vcf") {

where hts_fname.size()-5 underflows as the length of hts_fname is only 4 here

Similarly, I find suboptimal the behavior of SHAPEIT5 of throwing an error when not providing the correct extension:

$ phase_common --input input.bcf --region chr1:0-100000 --output output

[SHAPEIT5] phase_common (jointly phase multiple common markers)
  * Author        : Olivier DELANEAU, University of Lausanne
  * Contact       : olivier.delaneau@gmail.com
  * Version       : 5.1.1 / commit = 990ed0d / release = 2023-05-08
  * Run date      : 18/07/2023 - 12:00:00

Files:
  * Input         : [input.bcf]
  * Output        : [output]
  * Output format : [bcf]

Parameters:
  * Seed    : 15052011
  * Threads : 1 threads
  * MCMC    : 15 iterations [5b + 1p + 1b + 1p + 1b + 1p + 5m]
  * PBWT    : [window = 4cM / depth = auto / modulo = auto / mac = 5 / missing = 0.1]
  * HMM     : [window = 4cM / Ne = 15000 / Constant recombination rate of 1cM per Mb]
...
...
...
Finalization:
  * HAP solving (0.15s)
  * HAP update (0.06s)
  * H2V transpose (0.01s)

ERROR: Filename extension of [output] not recognized

Where the error comes from xcftools/common/src/utils/xcf.h which is called by shapeit5/phase_common/src/io/haplotype_writer.cpp. Ideally the error should be issued before the whole phasing computation is performed so that the user can correct the problem without delay. I also find it odd that an error is issued at all as it seems from the initial line Output format : [bcf] that SHAPEIT5 had decided that the output would be bcf regardless of the extension

Vorrid commented 1 year ago

I have the same problem as your 2nd example. I tried multiple output extensions but only .bcf extensions appear to be accepted, anything else provides the ERROR: Filename extension of [output.***] not recognized. I tried: .vcf.gz, .vcf, .xcf, and no extension at all (same as the example by freeseek). I tried with and without the --output-format [bh/bcf/vcf] command for .vcf extensions and only with that command for .xcf. Again, .bcf output does work.

A different problem i encountered - apologies for not making a new thread - was shapeit input problems where my files could not be read, resulting in an instantaneous 'segmentation fault' error. With a lot of trial and error I deduced this to originate from the input vcf file header lines that specify the chromosome length. I tested with and without the 'chr' at ID=1 but that didn't matter, then I changed the chr length. With the below header the phasing input failed (only for shapeit, no issues with bcftools at all) ##contig=<ID=1,length=249250621> - this corresponds to b37 chr1 size And with the below header it worked ##contig=<ID=1,length=248956422> - this corresponds to b38 chr1 size

In this case I started with a b37 file, used LiftOver to create b38 positions, but plink1.9 and plink 2.0 both saved the resulting vcf file with b37 chr lengths in the header. Again, BCFtools and plink had no issues with taking this vcf file as input, but Shapeit5.1.1 didn't take it as input. I suppose it detects some inconsistency with expected genome build based on the vcf header and the actual variant positions in the file? Perhaps this is purely caused by my inability but I figured I'd report it.

srubinacci commented 1 year ago

Thank you for reporting this.

I am slowly making quite some changes in the xcf code to include a very fast ligation, mainly targeted to refrence-based phasing (I recently added a concat naive option for the xcf format). Will keep your suggestions in mind and propagate them in the xcf.h class. As an additional note, I believe that we don't actually support .vcf anymore as they cannot be indexed (the htslib api would work reading them, but SHAPEIT/XCF/IMPUTE all require indexed files).

Meanwhile, please always use the bcf extension. Indeed, @Vorrid, I am unsure about the behavior you found. Can you confirm that .bcf won't work? Regarding your second problem, I am unsure what the source of the problem is. Are you files compressed and indexed?

Simone

srubinacci commented 1 year ago

@freeseek I should have fixed the problem in the latest commit of xcftools. https://github.com/odelaneau/xcftools/issues/1

Let me know if it all works as expected.