ngless-toolkit / ngless

NGLess: NGS with less work
https://ngless.embl.de
Other
142 stars 24 forks source link

GFF subfeature counting should be expanded #64

Closed unode closed 6 years ago

unode commented 6 years ago

The GFF3 format specification allows for multiple value attributes if separated with a comma. From the official docs:

Parent=AF2312,AB2812,abc-3

With the current version of NGLess a GFF file:

##gff-version 3
reference   protein_coding  gene    40  100 .   +   .   gene_id=geneA;gene_name=featA1,featA2
reference   protein_coding  gene    110 130 .   +   .   gene_id=geneB;gene_name=featA1
reference   protein_coding  gene    140 200 .   +   .   gene_id=geneC;gene_name=featA2

and a script:

    ngless "0.7"

    input = fastq('reads.fq.gz')
    mapped = map(input, fafile='ref.fna.gz')

    union = count(mapped,
                  gff_file='features.gff',
                  features=['gene'],
                  subfeatures=['gene_name'],
                  mode={union})
    write(union, ofile='output.txt')

produces:

    reads.fq.gz
-1  4
featA1  0
featA1,featA2   4
featA2  1

however the expectation is that values are expanded. Additionally, expansion should take into account the content of arguments mode= and multiple= in count():

    reads.fq.gz
-1  4
featA1  4
featA2  5
unode commented 6 years ago

This was fixed in 1dda615716d6e3748b07b52d0a63d30bc497b359