Open zslastman opened 1 year ago
Hi,
The criteria by which different annotations are used for the different rounds is the file in data/btypes_crounds.csv which contains a two column tables listing which biotypes are included either in the small or long RNA rounds.
The protein_coding case is included in the list so it should be taken into account in the long round without further configuration of the btypes_crounds.csv. However, the long round assumes the input gtf have gene, transcript, exon annotation types. By default it uses annotations under type=exon to assign reads to exons and then under type=gene to assign remaining reads to introns. However, if I understand correctly, in your case you are setting the annotation to be of type "transcript" so this is probably never counted while the software looks for exon and gene, and then, the resultant assignations file is empty and the communities step fails. It would be more advisable to encode the synthetic annotation regions to be counted in the small round (does not distinguish exons or introns). For this, you could set amplicon_gr$type = 'transcript' and amplicon_gr$transcript_biotype = 'sRNA' instead of protein_coding.
As this is a custom way of employing MGcount, if you want to provide me with a region example and a minimum .bam file example, I could also test it and validate this configuration for you. Let me know.
I want to use MGcounts's ability to find communities of multimapping features, on some sequencing data of synthetic constructs. I have made a fasta, aligned to it, and made a fake gtf file with all of the columns from your example gtf, when running MGcounts I get this error: