sapporo-wes / tataki

Command line tool for detecting life science data types.
Apache License 2.0
4 stars 2 forks source link

List up the cases where the result of tataki is against our expectations #6

Open inutano opened 4 months ago

inutano commented 4 months ago

case: a file only with header lines

##fileformat=VCFv4.2
##nanopolish_window=MN908947.3:1-29902
##INFO=<ID=TotalReads,Number=1,Type=Integer,Description="The number of event-space reads used to call the variant">
##INFO=<ID=SupportFraction,Number=1,Type=Float,Description="The fraction of event-space reads that support the variant">
##INFO=<ID=SupportFractionByStrand,Number=2,Type=Float,Description="Fraction of event-space reads that support the variant for each strand">
##INFO=<ID=BaseCalledReadsWithVariant,Number=1,Type=Integer,Description="The number of base-space reads that support the variant">
##INFO=<ID=BaseCalledFraction,Number=1,Type=Float,Description="The fraction of base-space reads that support the variant">
##INFO=<ID=AlleleCount,Number=1,Type=Integer,Description="The inferred number of copies of the allele">
##INFO=<ID=StrandSupport,Number=4,Type=Integer,Description="Number of reads supporting the REF and ALT allele, by strand">
##INFO=<ID=StrandFisherTest,Number=1,Type=Integer,Description="Strand bias fisher test">
##INFO=<ID=SOR,Number=1,Type=Float,Description="StrandOddsRatio test from GATK">
##INFO=<ID=RefContext,Number=1,Type=String,Description="The reference sequence context surrounding the variant call">
##INFO=<ID=Pool,Number=1,Type=String,Description="The pool name">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

This file will be detected as a bed file, as it does not contain lines.

case: gzipped binary files

$ tataki tiny.bam.gz --yaml -v
[2024-07-11T07:26:40Z INFO  tataki::module] tataki started
[2024-07-11T07:26:40Z DEBUG tataki::module] Args: Args { input: ["tiny.bam.gz"], output: None, output_format: Csv, yaml: true, cache_dir: None, conf: None, tidy: false, no_decompress: false, num_records: 100000, dry_run: false, verbose: true, quiet: false }
[2024-07-11T07:26:40Z DEBUG tataki::module] Output format: Yaml
[2024-07-11T07:26:40Z INFO  tataki::module] Created temporary directory: /tmp/tataki_2024-0711-162640_BgiSCI
[2024-07-11T07:26:40Z INFO  tataki::module] Processing input: tiny.bam.gz
[2024-07-11T07:26:40Z DEBUG tataki::source] Provided input is in GZ format
Error: stream did not contain valid UTF-8

The file is gzipped, but the tataki (specifically the internal Rust GZ decoder) expects a flat file out from it.

case: BGZF

tataki SAMPLE_01.pass.vcf.gz --yaml
[2024-07-11T07:36:49Z INFO  tataki::module] tataki started
[2024-07-11T07:36:49Z INFO  tataki::module] Created temporary directory: /tmp/tataki_2024-0711-163649_HL6qcv
[2024-07-11T07:36:49Z INFO  tataki::module] Processing input: SAMPLE_01.pass.vcf.gz
[2024-07-11T07:36:49Z INFO  tataki::parser] Invoking parser empty
[2024-07-11T07:36:49Z INFO  tataki::parser] Invoking parser bam
[2024-07-11T07:36:49Z INFO  tataki::parser] Invoking parser bcf
[2024-07-11T07:36:49Z INFO  tataki::parser] Invoking parser bed
[2024-07-11T07:36:49Z INFO  tataki::parser] Invoking parser cram
[2024-07-11T07:36:49Z INFO  tataki::parser] Invoking parser fasta
[2024-07-11T07:36:49Z INFO  tataki::parser] Invoking parser fastq
[2024-07-11T07:36:49Z INFO  tataki::parser] Invoking parser gff3
[2024-07-11T07:36:49Z INFO  tataki::parser] Invoking parser gtf
[2024-07-11T07:36:49Z INFO  tataki::parser] Invoking parser sam
[2024-07-11T07:36:49Z INFO  tataki::parser] Invoking parser vcf
[2024-07-11T07:36:49Z INFO  tataki::module] Detected!! vcf
[2024-07-11T07:36:49Z INFO  tataki::module] Deleting temporary directory: /tmp/tataki_2024-0711-163649_HL6qcv
SAMPLE_01.pass.vcf.gz:
  id: http://edamontology.org/format_3016
  label: VCF
  decompressed:
    label: null
    id: null

The file SAMPLE_01.pass.vcf.gz looks like a normal GZIP file, but it is a BGZF (Blocked GNU Zip Format) file. As it has a header which shows the file inside is VCF, tataki tells that it is a normal VCF file.

fmaccha commented 4 months ago

I wil describe these in README

inutano commented 4 months ago

note: tataki may need to distinguish these formats: compressed BCF, uncompressed BCF, compressed VCF, uncompressed VCF