hoffmangroup / genomedata

The Genomedata format for storing large-scale functional genomics data.
https://genomedata.hoffmanlab.org/
GNU General Public License v2.0
2 stars 1 forks source link

load_genomedata: responds poorly to invalid syntax [only a track path, instead of (name, path) tuple] #33

Open EricR86 opened 7 years ago

EricR86 commented 7 years ago

Original report (archived issue) by Coby Viner (Bitbucket: cviner2, GitHub: cviner).


load_genomedata does not fail-fast nor return a clear error message when a track path is directly provided (tracks=['./5xC-sorted.bedGraph.gz']), as opposed to correctly providing a track name and (file or directory) path as a tuple (tracks=('5xC', './5xC-sorted.bedGraph.gz')).

An invocation of load_genomedata that results in this issue is provided below.

#!python

load_genomedata.load_genomedata('./testArchive', tracks=['./5xC-sorted.bedGraph.gz'],
seqfilenames=['/mnt/work1/data/genomes/human/hg19/iGenomes/Sequence/Chromosomes/chrY.fa',
 '/mnt/work1/data/genomes/human/hg19/iGenomes/Sequence/Chromosomes/chr21.fa',
 '/mnt/work1/data/genomes/human/hg19/iGenomes/Sequence/Chromosomes/chr5.fa',
 '/mnt/work1/data/genomes/human/hg19/iGenomes/Sequence/Chromosomes/chr3.fa',
 '/mnt/work1/data/genomes/human/hg19/iGenomes/Sequence/Chromosomes/chr2.fa',
 '/mnt/work1/data/genomes/human/hg19/iGenomes/Sequence/Chromosomes/chr6.fa',
 '/mnt/work1/data/genomes/human/hg19/iGenomes/Sequence/Chromosomes/chr16.fa',
 '/mnt/work1/data/genomes/human/hg19/iGenomes/Sequence/Chromosomes/chr20.fa',
 '/mnt/work1/data/genomes/human/hg19/iGenomes/Sequence/Chromosomes/chr15.fa',
 '/mnt/work1/data/genomes/human/hg19/iGenomes/Sequence/Chromosomes/chr12.fa',
 '/mnt/work1/data/genomes/human/hg19/iGenomes/Sequence/Chromosomes/chrM.fa',
 '/mnt/work1/data/genomes/human/hg19/iGenomes/Sequence/Chromosomes/chr1.fa',
 '/mnt/work1/data/genomes/human/hg19/iGenomes/Sequence/Chromosomes/chr4.fa',
 '/mnt/work1/data/genomes/human/hg19/iGenomes/Sequence/Chromosomes/chr9.fa',
 '/mnt/work1/data/genomes/human/hg19/iGenomes/Sequence/Chromosomes/chr18.fa',
 '/mnt/work1/data/genomes/human/hg19/iGenomes/Sequence/Chromosomes/chr10.fa',
 '/mnt/work1/data/genomes/human/hg19/iGenomes/Sequence/Chromosomes/chr22.fa',
 '/mnt/work1/data/genomes/human/hg19/iGenomes/Sequence/Chromosomes/chr14.fa',
 '/mnt/work1/data/genomes/human/hg19/iGenomes/Sequence/Chromosomes/chrX.fa',
 '/mnt/work1/data/genomes/human/hg19/iGenomes/Sequence/Chromosomes/chr11.fa',
 '/mnt/work1/data/genomes/human/hg19/iGenomes/Sequence/Chromosomes/chr13.fa',
 '/mnt/work1/data/genomes/human/hg19/iGenomes/Sequence/Chromosomes/chr19.fa',
 '/mnt/work1/data/genomes/human/hg19/iGenomes/Sequence/Chromosomes/chr8.fa',
 '/mnt/work1/data/genomes/human/hg19/iGenomes/Sequence/Chromosomes/chr17.fa',
 '/mnt/work1/data/genomes/human/hg19/iGenomes/Sequence/Chromosomes/chr7.fa'])
EricR86 commented 7 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


Notably this example manages to hang indefinitely

EricR86 commented 7 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


Upon some investigation there could be a hidden underyling problem. The fact that tracks option is not a tuple does not clearly explain the entire problem. Notably there is this from the load_seq code which happens before the tracks are parsed. From genomedata/_load_seq.py:243:

    warnings.simplefilter("ignore")
    with Genome(gdpath, mode="w", filters=FILTERS_GZIP) as genome:
        if seqfile_type == "sizes":
            for name, size in sizes.items():
                chromosome = create_chromosome(genome, name, mode)
                size_chromosome(chromosome, size)
        else:
            assert seqfile_type in frozenset(["agp", "fasta"])
            for filename in filenames:
                if verbose:
                    print(filename, file=sys.stderr)

                with maybe_gzip_open(filename) as infile:
                    if seqfile_type == "agp":
                        name = path(filename).name.rpartition(".agp")[0]
                        chromosome = create_chromosome(genome, name, mode)
                        read_assembly(chromosome, infile)
                    else:
                        for defline, seq in LightIterator(infile):
                            chromosome = create_chromosome(genome, defline, mode)
                            read_seq(chromosome, seq)
    # XXX: this should be enforced even when there is an exception
    # is there a context manager available?
    warnings.resetwarnings()
EricR86 commented 7 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


@cviner is the 5xC-sorted.bedgraph.gz available or is there a smaller bedGraph that produces similar results?

EricR86 commented 7 years ago

Original comment by Coby Viner (Bitbucket: cviner2, GitHub: cviner).


It is not public data and is ~ 4 MiB. It is difficult for me to see how it could depend on that particular bedGraph (does any simple bedGraph not reproduce this?). I can give you a copy of it for local testing, if necessary though.

I don't know if this occurs for others, as I only made this mistake the one time, in an interactive session.

EricR86 commented 7 years ago

Original comment by Coby Viner (Bitbucket: cviner2, GitHub: cviner).


EricR86 commented 5 years ago

Original comment by Coby Viner (Bitbucket: cviner2, GitHub: cviner).