hitaandrea / MGcount

MGcount github repository
GNU General Public License v3.0
14 stars 9 forks source link

MGCount fails to run #8

Open Rohit-Satyam opened 1 year ago

Rohit-Satyam commented 1 year ago

Hi

I was trying MGCount on Plasmodium Falciparum BAM files obtained from cellranger (10X 3' chemistry) and I get the following error. Can you suggest what's wrong here?

Python 3.11.5 MGcount RNA-seq quantification pipeline v1.1.0

python3 -m mgcount -T 10 --gtf PlasmoDB-64_Pfalciparum3D7.gtf --outdir mgcount_outputs --bam_infiles bams.txt
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/subudhak/miniconda3/envs/mgcount/lib/python3.11/site-packages/mgcount/__main__.py", line 331, in <module>
    main()
  File "/home/subudhak/miniconda3/envs/mgcount/lib/python3.11/site-packages/mgcount/__main__.py", line 224, in main
    gtf = read_gtf(gtf_filename)
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/subudhak/miniconda3/envs/mgcount/lib/python3.11/site-packages/gtfparse/read_gtf.py", line 208, in read_gtf
    result_df = parse_gtf_and_expand_attributes(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/subudhak/miniconda3/envs/mgcount/lib/python3.11/site-packages/gtfparse/read_gtf.py", line 151, in parse_gtf_and_expand_attributes
    result = parse_gtf(
             ^^^^^^^^^^
  File "/home/subudhak/miniconda3/envs/mgcount/lib/python3.11/site-packages/gtfparse/read_gtf.py", line 82, in parse_gtf
    chunk_iterator = pd.read_csv(
Rohit-Satyam commented 1 year ago

The MGCount binary file works but it fails to detect subread installed in the conda environment. I had to install subread using sudo apt-get install subread. Can you see why the binary is unable to locate subread in the conda environment?

NFO:root:Extracted GTF attributes: ['gene_id', 'ID', 'description', 'ebi_biotype', 'original_biotype', 'transcript_id', 'Parent', 'gene_ebi_biotype', 'Name', 'Note', 'protein_source_id', '7.1.1.9', '2.1.1.56', '3.5.1.89', '2.1.1.228', '2.1.1.64', '4.2.99.18', '1.3.1.89', '2.4.99.18', '6.2.1.3', '2.3.2.23', '6.3.1.20', '3.1.1.4', '2.7.11.17', '6.3.5.1', '2.3.1.225', '6.3.2.-', '5.4.2.2', '4.2.1.17', '2.1.1.282', '2.5.1.114', '2.1.1.12', '2.3.1.1', '2.3.1.48', '2.3.1.12', '1.17.7.3', '6.3.4.15', '6.4.1.2', '4.1.1.50', '5.4.99.27', '3.1.1.5', '7.1.2.2', '2.7.7.63', '2.7.1.151', '2.1.1.32', '3.1.4.17', '3.1.4.35', '3.6.3.6', '2.3.1.257', '6.3.2.17', '1.1.1.37', '3.5.1.98', '4.2.1.134', '2.1.1.220', '3.1.16.1', '2.7.11.1', '4.2.1.33', '2.1.1.141', '5.99.1.2', 'involved', '3.4.24.55', '2.1.1.320', '3.1.3.32', '4.1.3.27', '2.8.1.-', '6.2.1.45', '2.5.1.60', '6.3.2.19', '2.7.8.41', '2.2.1.7', '2.7.7.72', '2.8.4.3', '2.1.1.193', '2.7.1.1', '3.6.4.12', '1.8.4.1', '2.7.8.2', '2.3.1.4', '3.5.4.9', '2.5.1.32', '2.5.1.90', '3.1.11.2', '2.1.1.201', '1.5.1.9', '2.3.1.168', '1.8.4.2', '2.7.4.6', '3.1.3.48', '2.7.1.67', '2.7.11.26', '7.6.2.1', '2.7.8.15', '3.4.21.26', '3.6.4.13', '3.1.3.99', '4.2.1.75', '2.4.1.83', '3.6.3.8', '1.11.1.9', '1.3.99.1', '1.1.1.94', '5.99.1.3', '7.1.3.1', '2.1.1.33', '2.1.1.43', '2.1.1.182', '2.1.1.62', '5.3.1.8', '2.3.1.88', '2.4.1.227', '2.7.6.3', '6.2.1.64', '2.5.1.31', '3.4.21.105', '2.7.1.150', '7.2.2.-', '5.4.99.12', '2.7.4.7', '3.6.1.11', '5.4.99.25', '3.1.3.2', '6.3.2.22', '2.1.1.183', '2.1.1.319', '3.1.1.32', '2.7.4.24', '3.6.5.1', '2.7.7.6', '7.1.1.8', '2.1.1.233', '2.3.1.231', '3.4.24.56', '4.2.1.24', '1.6.1.1', '1.6.1.2', '7.1.1.1', '3.1.1.31', '3.5.99.6', '4.6.1.16', '5.1.99.6', '6.3.4.14', '1.17.7.4', '3.1.11.1', '2.7.7.41', '2.7.4.3', '3.1.4.1', '6.2.1.5', '3.5.1.2', '4.3.3.6', '2.1.1.31', '1.14.11.27', '2.5.1.10', '2.5.1.29', '5.4.2.3', '1.18.1.6', '4.2.1.93', '3.4.19.12', '3.4.14.4', '2.6.99.4', '2.1.1.45', '2.1.1.29', '2.7.11.2', '3.1.1.23', '3.6.1.9']
--------------------------------------------------------
Assigning alignments to features (1/3)
--------------------------------------------------------

featureCounts not found. Please, check the program is available on the path specified by --featureCounts_path argument (default: /user/bin/featureCounts)

MGCount then runs for a while and finally fails and exit with a message:

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "multiprocessing/pool.py", line 119, in worker
  File "multiprocessing/pool.py", line 44, in mapstar
  File "mgcount/hierarchassign.py", line 107, in assign_rna_reads_hierarchically
FileNotFoundError: [Errno 2] No such file or directory: 
'/mgcount_outputs/.mg_2k3cvben/possorted_genome_fids.txt'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "MGcount.py", line 7, in <module>
  File "mgcount/__main__.py", line 293, in main
  File "mgcount/hierarchassign.py", line 61, in read_assignation_loop
  File "multiprocessing/pool.py", line 266, in map
  File "multiprocessing/pool.py", line 644, in get
FileNotFoundError: [Errno 2] No such file or directory: 
'/mgcount_outputs/.mg_2k3cvben/possorted_genome_fids.txt'
[729216] Failed to execute script MGcount
Rohit-Satyam commented 9 months ago

hi @hitaandrea @mgcount-github Can you respond to this?

hitaandrea commented 9 months ago

Hi Rohit,

I am sorry for my late reply. I wanted to check and try to reproduce the conda issue and did not had a moment for this yet. As for the second error, this usually happens when the gtf attributes are not named as MGcount expects.

In your specific case, I see your gtf contains at least the following columns: 'gene_id', 'ID', 'description', 'ebi_biotype', 'original_biotype', 'transcript_id', 'Parent', 'gene_ebi_biotype', 'Name', 'Note', 'protein_source_id'. By default, MGcount tries to extract "gene_name" and "gene_biotype" for long transcripts and "transcript_name" and "transcript_biotype" for small transcripts. I believe this could be causing the error.

If this is the case, you could either reformat your gtf with the default column names expected by MGcount (see --help) or specify the columns of interest with MGcount arguments (e.g. --feature_output_long gene_id --feature_output_small gene_id --feature_biotype_long gene_ebi_biotype --feature_biotype_small gene_ebi_biotype --feature_small gene if you want to use gene-expression as output for both small RNAs and long RNAs). Additionally, MGcount would expect that your gtf contains unde the type column entries specified as "gene" (and optionally "exon" and "trascript" for long RNAs which are used to output exonic/intronic counts). If you share with me the gtf, I could take a look at it. For the future version, I plan to add some warnings to MGcount in order to indicate what may be missing in the gtf to orient the user. I hope this helps.. let me know!

loganminhdang commented 8 months ago

@Rohit-Satyam @hitaandrea I wonder if you have any updates. I encounter the same error even with the MGcount-provided mouse gtf file, running on 10X Chromium single-cell bam files. It's quite challenging to debug what has gone wrong. The step that has failed is when MGcount tries to extract the multi-loci group. The error message is below. Any update or guidance is deeply appreciated. One thing worth noting is that I have run MGcount successfully on bam files resulting from bulk RNA-seq experiments, and wonder if the 10X bam file format is the culprit behind the algorithm failing.

Traceback (most recent call last): File "/home/qdang/micromamba/envs/MGcount/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/qdang/micromamba/envs/MGcount/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/qdang/micromamba/envs/MGcount/lib/python3.10/site-packages/mgcount/main.py", line 331, in main() File "/home/qdang/micromamba/envs/MGcount/lib/python3.10/site-packages/mgcount/main.py", line 296, in main ml = mg.MG(infiles, outdir, tmppath, gtf, crounds, btype_crounds, n_cores, File "/home/qdang/micromamba/envs/MGcount/lib/python3.10/site-packages/mgcount/multigraph.py", line 47, in MG counts = np.zeros(pd.read_table( File "/home/qdang/micromamba/envs/MGcount/lib/python3.10/site-packages/pandas/util/_decorators.py", line 311, in wrapper return func(*args, kwargs) File "/home/qdang/micromamba/envs/MGcount/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 779, in read_table return _read(filepath_or_buffer, kwds) File "/home/qdang/micromamba/envs/MGcount/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 575, in _read parser = TextFileReader(filepath_or_buffer, kwds) File "/home/qdang/micromamba/envs/MGcount/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 933, in init self._engine = self._make_engine(f, self.engine) File "/home/qdang/micromamba/envs/MGcount/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1217, in _make_engine self.handles = get_handle( # type: ignore[call-overload] File "/home/qdang/micromamba/envs/MGcount/lib/python3.10/site-packages/pandas/io/common.py", line 789, in get_handle handle = open( FileNotFoundError: [Errno 2] No such file or directory: '/mnt/sdb/qdang/singlecell/MGcount/outputs/.mg_napnf3qy/e175_possorted_genome_counts_small.csv'.

hitaandrea commented 7 months ago

Hello loganminhdang,

The problem here does not come from the gtf as in Rohit-Satyam case. From the error messages, it looks like counts file is missing for small-RNAs assignation round so the graph cannot be built afterwards. We have processed 10x dataset with MGcount successfully in the past using STAR as the aligner. However, as you said, there might be any particularity in your input alignments that unfortunately, is giving issues to MGcount. If you send me a .bam subsample of "e175_possorted_genome" to my email address I could try to help exploring this further. Please, let me know.

loganminhdang commented 7 months ago

Dear Hita,

Thank you for your response! This tool is wonderful to use, and it bugs me that this run wasn't successful, as MGcount helped me in my bulk RNA-seq analysis before.

My 10X data was also passed through the CellRanger pipeline with STAR as the aligner. However, I'm wondering whether it may be the case that our 10X transcriptome simply wasn't able to capture small RNAs due to their lack of polyA tails: https://kb.10xgenomics.com/hc/en-us/articles/217261666-What-types-of-RNA-can-be-detected-/, hence why small RNAs cannot be assigned.

Could you let me know how your 10X data was generated (did you use the standard 3' workflow, and did you perform any additional steps to better capture small RNAs)? That should help me determine if this is simply a constraint imposed by our data.

Thank you so much and all the best!

anna-alemany commented 7 months ago

Dear loganminhdang,

I jump in the conversation, since I work a lot with hita.

In the 10x that hita refers to, we used the standard 3' workflow. We first used the output from cellranger to filter out low quality cell barcodes and doublets, and then went back to fastq files to generate individual files for each cell. We then map this ourselves using STAR and the mgcount-specific gtf files. I've never work with the cellranger-produced bamfiles, maybe there is some issue there?

Best, Anna

selouej commented 1 month ago

Dear all, I jump to the conversation to give my feedback on this issue as I tried to launch mgcount. I faced the same issue like reported by Rohit. First, in the example provided to test mgcount in the command line, we have to specify the right GTF file: --gtf annotations_gtf/Homo_sapiens.GRCh38.custom.gtf Second in the "/path/to/gtfparse/read_gtf.py" one solution is to replace warn_bad_lines=False and error_bad_lines=False by on_bad_lines='skip' (I think it's related to pandas version) And finally, after subread installation make sure to add it to your PATH, if it's not the case.

It will be appreciated to have your feedbacks. Thank you!

Best, Sahar