hitaandrea / MGcount

MGcount github repository
GNU General Public License v3.0
14 stars 9 forks source link

MGcounts throws error with a synthetic GTF #7

Open zslastman opened 1 year ago

zslastman commented 1 year ago

I want to use MGcounts's ability to find communities of multimapping features, on some sequencing data of synthetic constructs. I have made a fasta, aligned to it, and made a fake gtf file with all of the columns from your example gtf, when running MGcounts I get this error:


Assigning alignments to features (1/3)
--------------------------------------------------------
--> long_exon assignation round for: sub.sorted
--> long_intron assignation round for: sub.sorted
--------------------------------------------------------
Extracting multi-loci groups (2/3)
--------------------------------------------------------
Traceback (most recent call last):
  File "/home/dharnett/miniconda3/envs/mgcounts/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/dharnett/miniconda3/envs/mgcounts/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/dharnett/miniconda3/envs/mgcounts/lib/python3.9/site-packages/mgcount/__main__.py", line 331, in <module>
    main()
  File "/home/dharnett/miniconda3/envs/mgcounts/lib/python3.9/site-packages/mgcount/__main__.py", line 296, in main
    ml = mg.MG(infiles, outdir, tmppath, gtf, crounds, btype_crounds, n_cores,
  File "/home/dharnett/miniconda3/envs/mgcounts/lib/python3.9/site-packages/mgcount/multigraph.py", line 47, in MG
    counts = np.zeros(pd.read_table(
  File "/home/dharnett/miniconda3/envs/mgcounts/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/home/dharnett/miniconda3/envs/mgcounts/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 683, in read_table
    return _read(filepath_or_buffer, kwds)
  File "/home/dharnett/miniconda3/envs/mgcounts/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 482, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/dharnett/miniconda3/envs/mgcounts/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 811, in __init__
    self._engine = self._make_engine(self.engine)
  File "/home/dharnett/miniconda3/envs/mgcounts/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1040, in _make_engine
    return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
  File "/home/dharnett/miniconda3/envs/mgcounts/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 51, in __init__
    self._open_handles(src, kwds)
  File "/home/dharnett/miniconda3/envs/mgcounts/lib/python3.9/site-packages/pandas/io/parsers/base_parser.py", line 222, in _open_handles
    self.handles = get_handle(
  File "/home/dharnett/miniconda3/envs/mgcounts/lib/python3.9/site-packages/pandas/io/common.py", line 701, in get_handle
    handle = open(
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/mydata/dharnett/mgcounts_output/.mg_8sdsmcnk/sub.sorted_counts_long_exon.csv'```

I am creating my fake gtf like so in R:
```#make a gtf file with annotations for MGcout
amplicon_gr = GRanges(c('backbone',insert_nms),IRanges(1,c(nchar(backboneseq),nchar(amplicons))),strand='+')
for(cnm in gtfcols) mcols(amplicon_gr)[[cnm]] = 'foo'
amplicon_gr$type = 'transcript'
amplicon_gr$score = 5
amplicon_gr$phase = 0
amplicon_gr$gene_id = names(backbone_amp_seqs)
amplicon_gr$gene_name = names(backbone_amp_seqs)
amplicon_gr$gene_biotype = 'protein_coding'
amplicon_gr$transcript_id = names(backbone_amp_seqs)
amplicon_gr$transcript_name = names(backbone_amp_seqs)
amplicon_gr$transcript_biotype = 'protein_coding'
amplicon_gr$exon = names(backbone_amp_seqs)
# amplicon_gr$gene_id = names(backbone_amp_seqs)
amplicon_gtf = file.path(amplicon_seqfile%>%str_replace('.fa','.gtf'))
rtracklayer::export(c(amplicon_gr),amplicon_gtf)```

The error tells me that MGcounts doesn't like all of my genes getting sorted into the 'small' category - so i need to adjust my fake gtf so that it also contains some 'long' exons. How can I go about doing this? What is the criterion by which things get put into the 'long_exon' category?
hitaandrea commented 1 year ago

Hi,

The criteria by which different annotations are used for the different rounds is the file in data/btypes_crounds.csv which contains a two column tables listing which biotypes are included either in the small or long RNA rounds.

The protein_coding case is included in the list so it should be taken into account in the long round without further configuration of the btypes_crounds.csv. However, the long round assumes the input gtf have gene, transcript, exon annotation types. By default it uses annotations under type=exon to assign reads to exons and then under type=gene to assign remaining reads to introns. However, if I understand correctly, in your case you are setting the annotation to be of type "transcript" so this is probably never counted while the software looks for exon and gene, and then, the resultant assignations file is empty and the communities step fails. It would be more advisable to encode the synthetic annotation regions to be counted in the small round (does not distinguish exons or introns). For this, you could set amplicon_gr$type = 'transcript' and amplicon_gr$transcript_biotype = 'sRNA' instead of protein_coding.

As this is a custom way of employing MGcount, if you want to provide me with a region example and a minimum .bam file example, I could also test it and validate this configuration for you. Let me know.