vanheeringen-lab / ANANSE

Prediction of key transcription factors in cell fate determination using enhancer networks. See full ANANSE documentation for detailed installation instructions and usage examples.
http://anansepy.readthedocs.io
MIT License
77 stars 16 forks source link

when loading H3K27ac data,ValueError: Length of values **does not match length of index ** #153

Closed xjtutgb closed 2 years ago

xjtutgb commented 2 years ago

image I used "ananse binding" as the following command:ananse binding -A 1.rep1.bam 1.rep2.bam -H 2.rep1.bam 2.rep2.bam -o ananse/ -r 1.rep1.narrowPeak 1.rep2.narrowPeak -g Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa -n 8."1.rep1.bam 1.rep2.bam" are two files of atac-seq files;"2.rep1.bam 2.rep2.bam" are two files of H3K27ac CUT&tag seq files.I encounted this error,i also tried assigne 1.rep1.bam 1.rep2.bam to -H。it also encountered the same error.

siebrenf commented 2 years ago

the error is happening in these few lines:

tmp = pd.DataFrame(index=self.regions)
for bam in bams:
    result = load_heatmap_data(...)
    tmp[result[0]] = result[2].T[0]

I think this means your bam files (-H 2.rep1.bam 2.rep2.bam) and your region files (-r 1.rep1.narrowPeak 1.rep2.narrowPeak) have some kind of unexpected mismatch... Are you using ANANSE 0.3.0? this version should remove the header from regionfiles. If you are using an older version, that could explain things.

apposada commented 2 years ago

Hi Maarten, Simon,

I think we found a similar problem (I was about to open a new issue as well). We are using ananse v0+untagged.1.g1a41f37.dirty .

A colleague and I are running ananse for a non-model organism species. We sucessfully generated a motif2factors custom pfm, however we are stuck at the onset of ananse binding. Please see the log attached below:

(ananse_pip_venv) [aperpos@nodos DMSO]$ ananse binding -A ATAC/ATAC_ALN_SAMPLE-1.bam ATAC/ATAC_ALN_SAMPLE-4.bam -g ~/PATH_TO_GENOME/genome/genome.fa -p ~/
PATH_TO_GENOME/genome/20211110_motif2factors/MOTIF2FACTORS.gimme.vertebrate.v5.0.pfm -o 20211115_DMSO_binding -r ~/PATH_TO_GENOME/genome/idr_ConsPeaks.bed -n 12
2021-11-16 11:05:27 | DEBUG | Motifs: /home/user/PATH_TO_GENOME/genome/20211110_motif2factors/MOTIF2FACTORS.gimme.vertebrate.v5.0.pfm
2021-11-16 11:05:42 | INFO | using motifs for 397 factors
2021-11-16 11:05:42 | DEBUG | reading default file
2021-11-16 11:06:07 | INFO | Scanning regions for motifs.
2021-11-16 11:06:07,501 - INFO - reading table
2021-11-16 11:06:18,561 - INFO - using 14000 sequences
2021-11-16 11:07:12,410 - INFO - creating score table (z-score, GC%)
2021-11-16 15:04:30,614 - INFO - done   
2021-11-16 15:04:38,078 - INFO - creating dataframe
2021-11-16 15:05:26 | INFO | loading ATAC data
2021-11-16 15:08:20 | ERROR | An error has been caught in function '<module>', process 'MainProcess' (20198), thread 'MainThread' (140611170875200):
Traceback (most recent call last):

> File "/home/user/PATH_TO_PIPENV/ananse_pip_venv/bin/ananse", line 370, in <module>
    args.func(args)
    │    │    └ Namespace(atac_bams=['ATAC/ATAC_ALN_SAMPLE-1.bam', 'ATAC/ATAC_ALN_SAMPLE-4.bam'], dist_func='peak_r...'
    │    └ <function binding at 0x7fe058144048>
    └ Namespace(atac_bams=['ATAC/ATAC_ALN_SAMPLE-1.bam', 'ATAC/ATAC_ALN_SAMPLE-4.bam'], dist_func='peak_r...'
  File "/home/user/PATH_TO_PIPENV/ananse_pip_venv/lib64/python3.6/site-packages/ananse/commands/binding.py", line 25, in binding
    ncore=args.ncore,
          │    └ 12
          └ Namespace(atac_bams=['ATAC/ATAC_ALN_SAMPLE-1.bam', 'ATAC/ATAC_ALN_SAMPLE-4.bam'], dist_func='peak_r...'
  File "/home/user/PATH_TO_PIPENV/ananse_pip_venv/lib64/python3.6/site-packages/ananse/peakpredictor.py", line 760, in predict_peaks
    ncore=ncore,
          └ 12
  File "/home/user/PATH_TO_PIPENV/ananse_pip_venv/lib64/python3.6/site-packages/ananse/peakpredictor.py", line 99, in __init__
    self.load_atac(atac_bams, update_models=False)
    │    │         └ ['/home/user/PATH_TO_GENOME/DMSO/ATAC/ATAC_ALN_SAMPLE-1.bam', '/home/user...'
    │    └ <function PeakPredictor.load_atac at 0x7fe05813aae8>
    └ <ananse.peakpredictor.PeakPredictor object at 0x7fe031b83e80>
  File "/home/user/PATH_TO_PIPENV/ananse_pip_venv/lib64/python3.6/site-packages/ananse/peakpredictor.py", line 390, in load_atac
    self._atac_data = self._load_bams(bams, title="ATAC", window=200)
    │    │            │    │          └ ['/home/user/PATH_TO_GENOME/DMSO/ATAC/ATAC_ALN_SAMPLE-1.bam', '/home/user...'
    │    │            │    └ <function PeakPredictor._load_bams at 0x7fe05813aa60>
    │    │            └ <ananse.peakpredictor.PeakPredictor object at 0x7fe031b83e80>
    │    └ None
    └ <ananse.peakpredictor.PeakPredictor object at 0x7fe031b83e80>
  File "/home/user/PATH_TO_PIPENV/ananse_pip_venv/lib64/python3.6/site-packages/ananse/peakpredictor.py", line 344, in _load_bams
    tmp[result[0]] = result[2].T[0]
    │   │            └ ('ATAC_ALN_SAMPLE-1.bam', [['Sc0000010', 4131230, 4131430, '', '+'], ['Sc0000517', 48260, 48460, '', '+'], ['Sc0...'
    │   └ ('ATAC_ALN_SAMPLE-1.bam', [['Sc0000010', 4131230, 4131430, '', '+'], ['Sc0000517', 48260, 48460, '', '+'], ['Sc0...'
    └ Empty DataFrame
      Columns: []
      Index: [Sc0000010:4131270-4131391, Sc0000517:48306-48415, Sc0000031:1882063-1882185, Sc0000161:39...
  File "/home/user/PATH_TO_PIPENV/ananse_pip_venv/lib64/python3.6/site-packages/pandas/core/frame.py", line 3044, in __setitem__
    self._set_item(key, value)
    │    │         │    └ array([122., 105., 149., ..., 134., 374., 134.])
    │    │         └ 'ATAC_ALN_SAMPLE-1.bam'
    │    └ <function DataFrame._set_item at 0x7fe06644af28>
    └ Empty DataFrame
      Columns: []
      Index: [Sc0000010:4131270-4131391, Sc0000517:48306-48415, Sc0000031:1882063-1882185, Sc0000161:39...
  File "/home/user/PATH_TO_PIPENV/ananse_pip_venv/lib64/python3.6/site-packages/pandas/core/frame.py", line 3120, in _set_item
    value = self._sanitize_column(key, value)
            │    │                │    └ array([122., 105., 149., ..., 134., 374., 134.])
            │    │                └ 'ATAC_ALN_SAMPLE-1.bam'
            │    └ <function DataFrame._sanitize_column at 0x7fe06644f488>
            └ Empty DataFrame
              Columns: []
              Index: [Sc0000010:4131270-4131391, Sc0000517:48306-48415, Sc0000031:1882063-1882185, Sc0000161:39...
  File "/home/user/PATH_TO_PIPENV/ananse_pip_venv/lib64/python3.6/site-packages/pandas/core/frame.py", line 3768, in _sanitize_column
    value = sanitize_index(value, self.index)
            │              │      │    └ <pandas._libs.properties.AxisProperty object at 0x7fe066449b38>
            │              │      └ Empty DataFrame
            │              │        Columns: []
            │              │        Index: [Sc0000010:4131270-4131391, Sc0000517:48306-48415, Sc0000031:1882063-1882185, Sc0000161:39...
            │              └ array([122., 105., 149., ..., 134., 374., 134.])
            └ <function sanitize_index at 0x7fe0664b1620>
  File "/home/user/PATH_TO_PIPENV/ananse_pip_venv/lib64/python3.6/site-packages/pandas/core/internals/construction.py", line 748, in sanitize_index
    "Length of values "
ValueError: Length of values (33907) does not match length of index (33921)

A quick look at the file with all enhancer (==ATAC) regions shows this.

(ananse_pip_venv) [aperpos@nodos genome]$ head idr_ConsPeaks.bed
Sc0000000       7408    7597
Sc0000000       14780   15045
Sc0000000       22881   23243
Sc0000000       23951   24242
Sc0000000       69534   69706
Sc0000000       80896   81089
Sc0000000       89637   89819
Sc0000000       100953  101378
Sc0000000       145876  146110
Sc0000000       167032  167292
(ananse_pip_venv) [aperpos@nodos genome]$ cat idr_ConsPeaks_amphi_DMSO.bed | wc -l
33921

If we understand correctly, it is reporting that some number of regions (perhaps those in the bam? although the bam is a continuum of read mappings..) does not match the number of enhancer regions provided in the bed, albeit the difference is very very low (14 in this case, up to 20-30 in other iterations using different region files). However we are not sure where is ananse finding that information, nor how can we fix it or provide it correctly. The bams were generated using the same version of the genome as the one in the enhancer regions file (i.e. same chromosome names). In fact, the regions file we provide is actually an IDR consensus of all the atac bams, meaning it comes from the same place as the bams. This has not happened with other non-model organisms I have ran ananse with. We are a bit at a loss here and wonder if perhaps you could shed some light on what might be happening.

Please let us know if there is something else you need to know or we can help with.

Thanks in advance and all the best,

Alberto

JGASmits commented 2 years ago

Hi Alberto, while waiting for another person to take a look at this problem, perhaps you can check if there are non-unique regions in your enhancer file that might cause to this error? I can imagine that might cause such a small number of regions to be removed.

Gr Jos

Maarten-vd-Sande commented 2 years ago

You could try sth in line of

uniq -u idr_ConsPeaks_amphi_DMSO.bed | wc -l
apposada commented 2 years ago

Hi Gr, Maarten,

We checked using sort and mergeBed. Unfortunately there does not seem to be non-unique regions in the file:

(base) [agilgal@nodos genome]$ wc -l idr_ConsPeaks_amphi_DMSO.bed
33921 idr_ConsPeaks_amphi_DMSO.bed
(base) [agilgal@nodos genome]$ sort -k1,1 -k2,2n idr_ConsPeaks_amphi_DMSO.bed | mergeBed -i - | wc -l
33921

Thanks in advance,

Alberto

Maarten-vd-Sande commented 2 years ago

@apposada could you email your bedfile and the sizes of all the chromosomes?

You can get the sizes of the chromosomes like this:

pip install pyfaidx
faidx input.fasta -i chromsizes > sizes.genome
JGASmits commented 2 years ago

Hi Alberto,

Another thing you could try is to make sure all your enhancer regions are of the same length (centered on the ATAC summit with aprox 200bp long regions). Right now the region length seems to vary between enhancers, which is non-ideal for motif scanning & normalizations downstream. If you have multiple narrowPeak files from ATAC seq experiments you need to combine can use the combine peak function from gimme motifs before using them as the bedfile input: https://github.com/vanheeringen-lab/gimmemotifs/blob/984399eaef3f05fd04c0b45c62efe9d287aaccf8/gimmemotifs/preprocessing.py#L193

Gr Jos

xjtutgb commented 2 years ago

the error is happening in these few lines:

tmp = pd.DataFrame(index=self.regions)
for bam in bams:
    result = load_heatmap_data(...)
    tmp[result[0]] = result[2].T[0]

I think this means your bam files (-H 2.rep1.bam 2.rep2.bam) and your region files (-r 1.rep1.narrowPeak 1.rep2.narrowPeak) have some kind of unexpected mismatch... Are you using ANANSE 0.3.0? this version should remove the header from regionfiles. If you are using an older version, that could explain things.

Thanks for your reply,i sure i am using 0.3.0,and the narrowpeak files is the raw results of MACS2 with no header;I also have tried to change the code part of load_H3K27_bam? ,in which the "width=2000" to "wildth=200" as the ATAC bam load part,it pass the test.Is it means the "width" parameter cause the error?

siebrenf commented 2 years ago

I think I found out what's happening. I've found 2 situations where we I get a similar error:

I'll let you know when I do have a fix for the 2nd issue. If you cannot wait, try to find which region(s) are not overlapping with your bam reads, and remove those.

@xjtutgb increasing the width parameter increases the change of finding overlap, so that could fix it. However, this is not what we want.

siebrenf commented 2 years ago

I think it's now fixed in the network branch. Tests are not ready yet, so if there are new issues, please let me know!

pip install git+https://github.com/vanheeringen-lab/ANANSE.git@network

apposada commented 2 years ago

Thanks, we will try this right away and will let you know what happens. Alberto

apposada commented 2 years ago

Hi, it seems to be working now after updating using pip install git+https://github.com/vanheeringen-lab/ANANSE.git@network . Thanks a lot for the quick fix.

Maarten-vd-Sande commented 2 years ago

We managed to run the binding step but the output does not look right. When we do ananse view of the binding.h5, we retrieve an empty matrix (only genomic regions as rows, but no columns of factors.

As always this is a non-model organism (xenopus) and we had to run motif2factors. The .pfm looks fine, as well as the .txt that accompanies it (see a snippet below). Are you aware of any issue related to this?

(ananse_pip_venv) [aperpos@nodos 20211110_motif2factors]$ grep ">" ~/.local/share/genomes/XENTR/XENTR.pep.fa | head
NM_001001190.1.p1
NM_001001191.1.p1
NM_001001196.1.p1
NM_001001197.1.p1
NM_001001198.1.p1
NM_001001199.1.p1
NM_001001200.1.p1
NM_001001202.1.p1
NM_001001204.1.p1
NM_001001205.1.p1
(ananse_pip_venv) [aperpos@nodos 20211110_motif2factors]$ head XENTR.gimme.vertebrate.v5.0.motif2factors.txt
Motif   Factor  Evidence        Curated
GM.5.0.Sox.0001 XM_018091270.1.p1       Orthologs       N
GM.5.0.Sox.0001 XM_012963160.2.p1       Orthologs       N
GM.5.0.Sox.0001 XM_018091271.1.p1       Orthologs       N
GM.5.0.Sox.0001 XM_012963174.2.p1       Orthologs       N
GM.5.0.Sox.0001 NM_001017004.2.p1       Orthologs       N
GM.5.0.Sox.0001 XM_012963150.2.p1       Orthologs       N
GM.5.0.Sox.0001 XM_012963155.2.p1       Orthologs       N
GM.5.0.Sox.0001 XM_018091269.1.p1       Orthologs       N
GM.5.0.Sox.0001 NM_001129937.1.p1       Orthologs       N

Thanks a lot, Alberto

siebrenf commented 2 years ago

I've made quite a few changes to this branch, and it should be stable now. (you will need to run the pip install cmd again)

Please note you'll need to update genomepy to version 0.11.0 (this is the only dependency change). Also note command line arguments have changed or added. Check ananse binding --help (and for other commands too) for details!

siebrenf commented 2 years ago

@apposada, try to rerun ANANSE binding after updating!

To speed up this test, you can specify a single bam file, and a region file with only few regions. Technically, 1 region is enough, as long as there is at least 1 bam read near. If you specify ~100 regions on chromosome 1, that should be enough to see if it works!

If you still have issues, your gene annotation might be the issue. Ensembl's Xenopus_tropicalis_v9.1 works pretty well for me :) You can get it with genomepy install Xenopus_tropicalis_v9.1 -p ensembl -a