Open LinearParadox opened 6 months ago
So i fixed this by removing the peak names. However this leads to another issue:
Traceback (most recent call last): File "/miniconda3/envs/MACS3/lib/python3.12/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, *kwds)) ^^^^^^^^^^^^^^^^^^^ File "/miniconda3/envs/MACS3/lib/python3.12/multiprocessing/pool.py", line 48, in mapstar return list(map(args)) ^^^^^^^^^^^^^^^^ File "/miniconda3/envs/MACS3/lib/python3.12/site-packages/MACS3/Commands/callvar_cmd.py", line 342, in call_variants_at_range PRI = collection.get_PosReadsInfo_ref_pos ( i, ref_nt, Q=minQ ) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "MACS3/Signal/RACollection.pyx", line 358, in MACS3.Signal.RACollection.RACollection.get_PosReadsInfo_ref_pos File "MACS3/Signal/RACollection.pyx", line 382, in MACS3.Signal.RACollection.RACollection.get_PosReadsInfo_ref_pos File "MACS3/Signal/ReadAlignment.pyx", line 376, in MACS3.Signal.ReadAlignment.ReadAlignment.get_variant_bq_by_ref_pos File "MACS3/Signal/ReadAlignment.pyx", line 408, in MACS3.Signal.ReadAlignment.ReadAlignment.get_variant_bq_by_ref_pos File "MACS3/Signal/ReadAlignment.pyx", line 307, in MACS3.Signal.ReadAlignment.ReadAlignment.get_REFSEQ ValueError: invalid literal for int() with base 10: b'' """
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/miniconda3/envs/MACS3/bin/macs3", line 1028, in
@LinearParadox It seems that MACS3 can't read the broadPeak file correctly. Could you share us the top few lines of your broadpeak file?
My current file is:
chr1 903718 907055 + 3338 chr1 936442 936670 + 229 chr1 940313 943409 + 3097 chr1 980575 982962 + 2388
I realize for the original error I accidentally put peak names as the first column. For context, I called a set of differential peaks in R. I'm interested in seeing if there are any variants within these peaks. So I filtered my bam file to include these peaks only, and am running it on the filtered bam and this bed file.
I tested running this on the original peak calls and still get the same error. The original peak calls:
chr1 17311 17589 Met.mRp.clN_peak_1 388 . 3.13389 40.8058 38.8152 chr1 136647 136963 Met.mRp.clN_peak_2 87 . 2.01892 10.5187 8.76954 chr1 139009 139315 Met.mRp.clN_peak_3 22 . 1.51709 3.67699 2.29135 chr1 180870 181834 Met.mRp.clN_peak_4 389 . 3.13599 40.9225 38.9172 chr1 182700 184519 Met.mRp.clN_peak_5 103 . 2.0168 11.9825 10.3187 chr1 185721 187131 Met.mRp.clN_peak_6 57 . 1.78087 7.34263 5.77247
Your original peak file looks fine to me. It's in the appropriate BED+ format. Only the first six columns are essential and they should be in order: 1. Chromosome; 2. Start; 3. End; 4. Name; 5. Score; 6. Strand. Perhaps you can try to cut the first six columns? Also, please make sure the delimiter in the file is "tab" (or "\t"). Then try again?
Hmm... I am wrong with my previous comments. The error was found in: https://github.com/macs3-project/MACS/blob/1d6c38ef4b800a50632d5ddbac79c175bfefe7d4/MACS3/Signal/ReadAlignment.pyx#L307
Here MACS tries to use the MD tag and CIGAR in BAM format to recover the reference sequence. MACS are supposed to see an integer but it turns out to be an empty string. I saw you mentioned that you made a BAM file for only the reads in the peaks. In fact, you don't have to do so. Please check the document here: https://macs3-project.github.io/MACS/docs/callvar.html You can use the unfiltered BAM file to call variants given the BAM file has been sorted and indexed (with a .bai file).
SO I tried it with this, and I'm getting a different error. It's not fatal, but I'm not sure if it's impacting the results:
Arguments: ('chr1', 226478788, 226479694, '. Skipped!')
---begin of peak---
INFO @ 06 May 2024 20:48:31: [130 MB] Peak: chr1 226537600 226537891
--- Logging error ---
Traceback (most recent call last):
File "/miniconda3/envs/MACS3/lib/python3.12/site-packages/MACS3/Commands/callvar_cmd.py", line 195, in run
ra_collection = RACollection( chrom, peak, tbam.get_reads_in_region( chrom, peak["start"], peak["end"], maxDuplicate=maxDuplicate ) )
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "MACS3/IO/BAM.pyx", line 537, in MACS3.IO.BAM.BAMaccessor.get_reads_in_region
File "MACS3/IO/BAM.pyx", line 574, in MACS3.IO.BAM.BAMaccessor.get_reads_in_region
OverflowError: value too large to convert to int
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "miniconda3/envs/MACS3/lib/python3.12/logging/__init__.py", line 1160, in emit
msg = self.format(record)
^^^^^^^^^^^^^^^^^^^
File "miniconda3/envs/MACS3/lib/python3.12/logging/__init__.py", line 999, in format
return fmt.format(record)
^^^^^^^^^^^^^^^^^^
File "miniconda3/envs/MACS3/lib/python3.12/logging/__init__.py", line 703, in format
record.message = record.getMessage()
^^^^^^^^^^^^^^^^^^^
File "miniconda3/envs/MACS3/lib/python3.12/logging/__init__.py", line 392, in getMessage
msg = msg % self.args
~~~~^~~~~~~~~~~
TypeError: not all arguments converted during string formatting
Call stack:
File "miniconda3/envs/MACS3/bin/macs3", line 1028, in <module>
main()
File "miniconda3/envs/MACS3/bin/macs3", line 105, in main
run( args )
File "miniconda3/envs/MACS3/lib/python3.12/site-packages/MACS3/Commands/callvar_cmd.py", line 197, in run
info ("No reads found in peak: ", chrom.decode(), peak["start"], peak["end"], ". Skipped!")
File "miniconda3/envs/MACS3/lib/python3.12/logging/__init__.py", line 1539, in info
self._log(INFO, msg, args, **kwargs)
File "miniconda3/envs/MACS3/lib/python3.12/site-packages/MACS3/Utilities/Logger.py", line 14, in _log
super()._log(level, f"[{mem_usage} MB] {msg}", args, exc_info, extra, stack_info)
Message: '[130 MB] No reads found in peak: '
Arguments: ('chr1', 226537600, 226537891, '. Skipped!')
Edit: This also produces an empty variants file, which is somehwat unexpected.
This error is caused due to the fact that, the size of a chunk of alignment result in BAM is decoded as larger than 2^32. It shouldn't happen normally since we only tried to decode 4bytes data into an unsigned int in the line of code pointed out by the error message:
There must be something I didn't notice before.
Could we check if this time, the BAM file you used is legit? How did you generate it? What's the output from samtools flagstat
? Is the ".BAI" file updated?
Also, thanks for pointing out the error related to:
Ideally, if there is any error during retrieving reads from a given peak region, MACS3 should skip the peak. And it seems this line make the whole program quit. A possible solution is to remove the peak from your peak file and try again. And I will fix it in the next release. -- fixed in 759e100
Let me double check if it's updated. Bam was generated through the nfcore atac seq pipeline. Weird thing is I remember running these samples through the same pipeline a while ago, and callvar running fine then. I reran them again, only difference being I used chromap as an aligner instead of bwa (default).
samtools flagstat gives the following:
1547945886 + 0 in total (QC-passed reads + QC-failed reads)
1547945886 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
1547945886 + 0 mapped (100.00% : N/A)
1547945886 + 0 primary mapped (100.00% : N/A)
1547945886 + 0 paired in sequencing
773972943 + 0 read1
773972943 + 0 read2
1547945886 + 0 properly paired (100.00% : N/A)
1547945886 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
I see. This is a very important information. I just tested a bam file from chromap
and can reproduce your error. Let me check where the problem is.
OK. Now I got it. Here is an alignment in SAM format from chromap
that can cause the problem in callvar
:
SRR1822137.116112 83 chrI 149 60 50M = 39 -160 GCCCTGTCCCATTCAACCATACCACTCCGAACCACCATCCATCCCTCTAC 1HFAFD>@GFIJJJJJJGHF?GC?F?GHGG@IIGJIHFCDF?AFFFF@@@ NM:i:1 MD:Z:A49
And this is the alignment from bwa
:
SRR1822137.116112 83 chrI 149 54 50M = 39 -160 GCCCTGTCCCATTCAACCATACCACTCCGAACCACCATCCATCCCTCTAC 1HFAFD>@GFIJJJJJJGHF?GC?F?GHGG@IIGJIHFCDF?AFFFF@@@ NM:i:1 MD:Z:0A49 MC:Z:50M AS:i:49 XS:i:44 XA:Z:chrII,-5937,50M,2;
The difference is on the 'MD' tag, with which callvar
can recover the original DNA sequence from the reference genome. According to the definition in SAM optional tags, the MD tag should have a format of MD:Z:[0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)*
. And there should be a number, even if it is 0
, at the beginning of the MD tag. In bwa, it's correct -- 0A49
, but in chromap
, it's incorrect -- A49
. When callvar
processes an MD tag as A49
, it will throw the error you found because it tries to interpret the first character in the MD tag as a number. There are more similar problems in the MD tag field in the SAM/BAM file generated from chromap
.
The solution: if you have the original fasta file of the genome, you can fix these 'MD' tags by using samtools calmd
function, and run callvar
on the new BAM file. Example:
samtools calmd -b chromap.bam genome.fa > md_fixed.bam
samtools sort -o md_fixed.sorted.bam md_fixed.bam
samtools index md_fixed.bam
So, @LinearParadox, do you mind submitting a ticket to chromap
regarding this?
Hi @LinearParadox, did the samtools calmd
command solve this issue?
Seems to be running fine now! Is there any way to speed it up by chance? It's been a day and I'm still waiting on it to finish.
There are a couple of things to tweak to make the process faster. 1. decrease the number of peaks you plan to call variants; 2. use multiple CPUs with option -m NP
; 3. turn off fermi for assembling the reads in the peak region with option -F off
, and this will sacrifice the specificity for sensitivity and speed.
Got it. The odd thing is when I pass the m option, my CPU usage never seems to go above 100% (when monitored with top). There seems to be no difference vs if I just didn't pass the option.
When running on a broadpeak file, callvar does the following:
File "MACS3/Signal/RACollection.pyx", line 358, in MACS3.Signal.RACollection.RACollection.get_PosReadsInfo_ref_pos File "MACS3/Signal/RACollection.pyx", line 382, in MACS3.Signal.RACollection.RACollection.get_PosReadsInfo_ref_pos File "MACS3/Signal/ReadAlignment.pyx", line 376, in MACS3.Signal.ReadAlignment.ReadAlignment.get_variant_bq_by_ref_pos File "MACS3/Signal/ReadAlignment.pyx", line 408, in MACS3.Signal.ReadAlignment.ReadAlignment.get_variant_bq_by_ref_pos File "MACS3/Signal/ReadAlignment.pyx", line 307, in MACS3.Signal.ReadAlignment.ReadAlignment.get_REFSEQ ValueError: invalid literal for int() with base 10: b'' """
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/miniconda3/envs/MACS3/bin/macs3", line 1028, in
main()
File "/envs/MACS3/bin/macs3", line 105, in main
run( args )
File "/miniconda3/envs/MACS3/lib/python3.12/site-packages/MACS3/Commands/callvar_cmd.py", line 235, in run
results = mapresults.get(timeout=window_size*300)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/MACS3/lib/python3.12/multiprocessing/pool.py", line 774, in get
raise self._value
ValueError: invalid literal for int() with base 10: b''