linnarsson-lab / loompy

Python implementation of the Loom file format - http://loompy.org
BSD 2-Clause "Simplified" License
140 stars 37 forks source link

problem with naming of fragments with mouse index build #143

Closed BenvanderVeer closed 3 years ago

BenvanderVeer commented 3 years ago

Hi, I made an index using the recent posted solution for the mouse index (with Kallisto), but when try to run loom fromfq on my fastq files, it get this error:

2020-11-24 18:53:33,384 - INFO - Using 25 threads.
2020-11-24 18:53:33,386 - INFO - kallisto bus -i /staging/leuven/stg_00072/reference/mm10/gencode.M23.GRCm38.p6/transcripts/kallisto_test/gencode.vM23.fragments.idx -o /local_scratch/tmp-vsc32808/tmpxem7z4gn -x 10xv3 -t 25 /staging/leuven/stg_00072/Ben/antNPCs/10xGenomics_F1Tm1/WT_day3/fastqFiles/200903.NovaSeq1.FCB/GC1002925_H6_S10_L001_R1_001.fastq.gz /staging/leuven/stg_00072/Ben/antNPCs/10xGenomics_F1Tm1/WT_day3/fastqFiles/200903.NovaSeq1.FCB/GC1002925_H6_S10_L001_R2_001.fastq.gz /staging/leuven/stg_00072/Ben/antNPCs/10xGenomics_F1Tm1/WT_day3/fastqFiles/200903.NovaSeq1.FCB/GC1002925_H6_S10_L002_R1_001.fastq.gz /staging/leuven/stg_00072/Ben/antNPCs/10xGenomics_F1Tm1/WT_day3/fastqFiles/200903.NovaSeq1.FCB/GC1002925_H6_S10_L002_R2_001.fastq.gz /staging/leuven/stg_00072/Ben/antNPCs/10xGenomics_F1Tm1/WT_day3/fastqFiles/200923.NovaSeq1.FCA/GC1002925_H6_S10_L001_R1_001.fastq.gz /staging/leuven/stg_00072/Ben/antNPCs/10xGenomics_F1Tm1/WT_day3/fastqFiles/200923.NovaSeq1.FCA/GC1002925_H6_S10_L001_R2_001.fastq.gz /staging/leuven/stg_00072/Ben/antNPCs/10xGenomics_F1Tm1/WT_day3/fastqFiles/200923.NovaSeq1.FCA/GC1002925_H6_S10_L002_R1_001.fastq.gz /staging/leuven/stg_00072/Ben/antNPCs/10xGenomics_F1Tm1/WT_day3/fastqFiles/200923.NovaSeq1.FCA/GC1002925_H6_S10_L002_R2_001.fastq.gz
2020-11-24 18:53:33,468 - INFO - [index] k-mer length: 31
2020-11-24 18:53:33,468 - INFO - [index] number of targets: 374,274
2020-11-24 18:53:33,468 - INFO - [index] number of k-mers: 142,414,483
2020-11-24 18:53:47,927 - INFO - [index] number of equivalence classes: 1,093,834
2020-11-24 18:53:51,330 - INFO - [quant] will process sample 1: /staging/leuven/stg_00072/Ben/antNPCs/10xGenomics_F1Tm1/WT_day3/fastqFiles/200903.NovaSeq1.FCB/GC1002925_H6_S10_L001_R1_001.fastq.gz
2020-11-24 18:53:51,330 - INFO -                                /staging/leuven/stg_00072/Ben/antNPCs/10xGenomics_F1Tm1/WT_day3/fastqFiles/200903.NovaSeq1.FCB/GC1002925_H6_S10_L001_R2_001.fastq.gz
2020-11-24 18:53:51,330 - INFO - [quant] will process sample 2: /staging/leuven/stg_00072/Ben/antNPCs/10xGenomics_F1Tm1/WT_day3/fastqFiles/200903.NovaSeq1.FCB/GC1002925_H6_S10_L002_R1_001.fastq.gz
2020-11-24 18:53:51,330 - INFO -                                /staging/leuven/stg_00072/Ben/antNPCs/10xGenomics_F1Tm1/WT_day3/fastqFiles/200903.NovaSeq1.FCB/GC1002925_H6_S10_L002_R2_001.fastq.gz
2020-11-24 18:53:51,330 - INFO - [quant] will process sample 3: /staging/leuven/stg_00072/Ben/antNPCs/10xGenomics_F1Tm1/WT_day3/fastqFiles/200923.NovaSeq1.FCA/GC1002925_H6_S10_L001_R1_001.fastq.gz
2020-11-24 18:53:51,330 - INFO -                                /staging/leuven/stg_00072/Ben/antNPCs/10xGenomics_F1Tm1/WT_day3/fastqFiles/200923.NovaSeq1.FCA/GC1002925_H6_S10_L001_R2_001.fastq.gz
2020-11-24 18:53:51,330 - INFO - [quant] will process sample 4: /staging/leuven/stg_00072/Ben/antNPCs/10xGenomics_F1Tm1/WT_day3/fastqFiles/200923.NovaSeq1.FCA/GC1002925_H6_S10_L002_R1_001.fastq.gz
2020-11-24 18:53:51,330 - INFO -                                /staging/leuven/stg_00072/Ben/antNPCs/10xGenomics_F1Tm1/WT_day3/fastqFiles/200923.NovaSeq1.FCA/GC1002925_H6_S10_L002_R2_001.fastq.gz
2020-11-24 19:04:53,543 - INFO - [quant] finding pseudoalignments for the reads ... done
2020-11-24 19:04:53,543 - INFO - [quant] processed 364,969,769 reads, 259,218,367 reads pseudoaligned
2020-11-24 19:04:58,344 - INFO - Loading gene metadata
2020-11-24 19:04:58,691 - INFO - Loading fragments-to-gene mappings
2020-11-24 19:04:58,947 - INFO - Indexing genes
Traceback (most recent call last):
  File "/data/leuven/328/vsc32808/miniconda/envs/loompy/bin/loompy", line 10, in <module>
    sys.exit(cli())
  File "/data/leuven/328/vsc32808/miniconda/envs/loompy/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/data/leuven/328/vsc32808/miniconda/envs/loompy/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/data/leuven/328/vsc32808/miniconda/envs/loompy/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/data/leuven/328/vsc32808/miniconda/envs/loompy/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/data/leuven/328/vsc32808/miniconda/envs/loompy/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/data/leuven/328/vsc32808/miniconda/envs/loompy/lib/python3.8/site-packages/loompy/commands.py", line 34, in fromfq
    create_from_fastq(loomfile, sampleid, list(fastqs), indexdir, metadatafile, threads)
  File "/data/leuven/328/vsc32808/miniconda/envs/loompy/lib/python3.8/site-packages/loompy/bus_file.py", line 429, in create_from_fastq
    bus = BusFile(
  File "/data/leuven/328/vsc32808/miniconda/envs/loompy/lib/python3.8/site-packages/loompy/bus_file.py", line 186, in __init__
    self.gene_for_fragment_idx[i] = accession_idx[self.gene_for_fragment[i]]
KeyError: 'ENSMUSG00000079800.2::GL456210.1:9123-58882'

I think it has to do with the naming of the genes/fragments, but I do not understand enough of the code to be able to fix it.

best, Ben

BenvanderVeer commented 3 years ago

Any update on this? Because I tried everything - using different genome versions etc, but it keeps giving me the same error, but i don't understand where the error is coming from?

For human data it works fine, but for the mouse index it doesn't work?

best, Ben

lsundaram commented 3 years ago

I face the same issue as well. Please let me know if you found a fix.

lsundaram commented 3 years ago

I could fix the issue by fixing the gene names in the fragements2genes.txt, unspliced_fragments.txt and the fragments.fa used to build the kallisto index. For some reason the gene names are appended with their coordinates which does not have a corresponding match with the gene metadata file. I think this is because of some string parsing error in the code used in generating these files. For now i fixed the files post generation with pandas and sed and it works all the way. Hope this helps

JBaiwir commented 3 years ago

Hello.

I found what causes the issue. I followed the pipeline here: https://github.com/linnarsson-lab/loompy/tree/master/kallisto And I faced the same type of error.

After some research, I found that an update in bedtools change its behavior. I fixed it based on this: https://github.com/arq5x/bedtools2/issues/805

If you followed the same pipeline (which is really great by the way), I suggest you to change the line 23 of the "mouse_download.sh" code as follow:

bedtools getfasta -name -fo gencode.vM23.unspliced.fa -fi GRCm38.primary_assembly.genome.fa -bed gencode.vM23.primary_assembly.annotation.sorted.bed

Becomes

bedtools getfasta -name -fi GRCm38.primary_assembly.genome.fa -bed gencode.vM23.primary_assembly.annotation.sorted.bed | sed 's/::.*//' > gencode.vM23.unspliced.fa

I just did this and rerun the pipeline and everything did as expected, and I could have a .loom file at the end.

I hope this helps. This is my first message on github, sorry if I did not followed some guidelines that I am unaware of. I was stuck with this same problem and I would be happy if someone already posted this answer.

I wish you all the best in your analyses.

Jérôme

BenvanderVeer commented 3 years ago

Changing the line works for me indeed! Thank you for your help!

best, Ben