kaizhang / SnapATAC2

Single-cell epigenomics analysis tools
https://kzhang.org/SnapATAC2/
195 stars 20 forks source link

Error in tl.merge_peaks with imported peak data. #243

Closed zkcao closed 4 months ago

zkcao commented 4 months ago

Dear Kai,

I tried to import some peak data with the following script, however, as I tried to call snap.tl.merge_peaks, it threw some data type conversion error. Can you offer some help on this? Thanks a lot for your help!

peaks_dict = {}

for celltype in groups:
    # Read the narrowPeak
    df = pd.read_csv(f"{}/{celltype}_macs3_peaks_peaks.narrowPeak".format(datadir), 
                     sep="\t", 
                     header=None)
    df[3] = "."
    df.columns = ['chrom', 'start', 'end', 'name', 'score', 'strand', 'signal_value',
                  'p_value', 'q_value', 'peak']

    df_pl = pl.from_pandas(df)

    # Ensure specific data types in Polars
    df_pl = df_pl.with_columns([
        df_pl['chrom'].cast(pl.Utf8),
        df_pl['start'].cast(pl.UInt64),
        df_pl['end'].cast(pl.UInt64),
        df_pl['name'].cast(pl.Utf8),
        df_pl['score'].cast(pl.UInt16),
        df_pl['strand'].cast(pl.Utf8),
        df_pl['signal_value'].cast(pl.Float64),
        df_pl['p_value'].cast(pl.Float64),
        df_pl['q_value'].cast(pl.Float64),
        df_pl['peak'].cast(pl.UInt64),
    ])

    # Add the polars DataFrame to the dictionary with the celltype as the key
    peaks_dict[celltype] = df_pl

# Store the dictionary in adata.uns under the key 'macs3'
adata.uns['macs3'] = peaks_dict

#Merge Peaks...
peaks = snap.tl.merge_peaks(adata.uns['macs3'], snap.genome.hg38)
thread '<unnamed>' panicked at src/call_peaks.rs:144:64:
called `Result::unwrap()` on an `Err` value: TryFromIntError(1006)
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[49], line 1
----> 1 peaks = snap.tl.merge_peaks(adata.uns['macs3'], snap.genome.hg38)

File ~/.conda/envs/scpy/lib/python3.9/site-packages/snapatac2/tools/_call_peaks.py:199, in merge_peaks(peaks, chrom_sizes, half_width)
    174 """Merge peaks from different groups.
    175 
    176 Merge peaks from different groups. This function is typically used to merge
   (...)
    196 macs3
    197 """
    198 chrom_sizes = chrom_sizes.chrom_sizes if isinstance(chrom_sizes, Genome) else chrom_sizes
--> 199 return _snapatac2.py_merge_peaks(peaks, chrom_sizes, half_width)

PanicException: called `Result::unwrap()` on an `Err` value: TryFromIntError(1006)
kaizhang commented 4 months ago

The 5th column of the narrow peak file produced by macs3 command line tool may contain invalid values, you need to fix that manually before using that for peak merging. Please read the macs3 documentation for more details:

NAME_peaks.narrowPeak is BED6+4 format file which contains the peak locations together with peak summit, p-value, and q-value. You can load it to the UCSC genome browser. Definition of some specific columns are:

5th: integer score for display. It's calculated as int(-10log10pvalue) or int(-10log10qvalue) depending on whether -p (pvalue) or -q (qvalue) is used as score cutoff. Please note that currently this value might be out of the [0-1000] range defined in UCSC ENCODE narrowPeak format. You can let the value saturated at 1000 (i.e. p/q-value = 10^-100) by using the following 1-liner awk: awk -v OFS="\t" '{$5=$5>1000?1000:$5} {print}' NAME_peaks.narrowPeak 7th: fold-change at peak summit 8th: -log10pvalue at peak summit 9th: -log10qvalue at peak summit 10th: relative summit position to peak start

zkcao commented 4 months ago

Thanks for the suggestions! After I fixed the 5th column with the awk -v OFS="\t" '{$5=$5>1000?1000:$5} {print}' NAME_peaks.narrowPeakcode, a new error occurred:

peaks = snap.tl.merge_peaks(adata.uns['macs3'], snap.genome.hg38)
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[79], line 1
----> 1 peaks = snap.tl.merge_peaks(adata.uns['macs3'], snap.genome.hg38)

File ~/.conda/envs/scpy/lib/python3.9/site-packages/snapatac2/tools/_call_peaks.py:199, in merge_peaks(peaks, chrom_sizes, half_width)
    174 """Merge peaks from different groups.
    175 
    176 Merge peaks from different groups. This function is typically used to merge
   (...)
    196 macs3
    197 """
    198 chrom_sizes = chrom_sizes.chrom_sizes if isinstance(chrom_sizes, Genome) else chrom_sizes
--> 199 return _snapatac2.py_merge_peaks(peaks, chrom_sizes, half_width)

PanicException: called `Option::unwrap()` on a `None` value

However, I checked for None value with the following code but could not find any None value:

for celltype, df in peaks_dict.items():
    print(f"Checking for None values in {celltype}:")

    none_found = False

    for col in df.columns:
        if df[col].is_null().sum() > 0:
            print(f"  - None values found in column '{col}'.")
            none_found = True

    if not none_found:
        print("  No None values found.")
    print("")  

Again, thanks a lot for the help!

kaizhang commented 4 months ago

Can you share with me the peak files and your code?

zkcao commented 4 months ago

sent! thanks a lot for your help!

kaizhang commented 4 months ago

sent! thanks a lot for your help!

Where did you send? I didn't see it.

zkcao commented 4 months ago

zhangkai33@westlake.edu.cn this one

kaizhang commented 4 months ago

The email address is correct. But I didn't receive it. Please try this one: zk65900931@gmail.com

zkcao commented 4 months ago

sent again!

kaizhang commented 4 months ago

Got it. Thank you!

kaizhang commented 4 months ago

The issue is that some chromosomes in your peak file are missing from the built-in genome. Note the default genome contains only regular chromosomes. Here is the code that works for you data:

import polars as pl
import snapatac2 as snap
import sys

def read_peaks(file):
    schema = {
        'chrom': pl.Utf8,
        'start': pl.UInt64,
        'end': pl.UInt64,
        'name': pl.Utf8,
        'score': pl.UInt16,
        'strand': pl.Utf8,
        'signal_value': pl.Float64,
        'p_value': pl.Float64,
        'q_value': pl.Float64,
        'peak': pl.UInt64,
    }
    df = pl.read_csv(file, has_header=False, separator='\t', schema=schema)
    return df

if __name__ == '__main__':
    files = sys.argv[1:]
    peaks = {}
    for file in files:
        df = read_peaks(file)
        df = df.filter(pl.col('chrom').is_in(snap.genome.hg38.chrom_sizes.keys()))
        peaks[file] = df
    print(snap.tl.merge_peaks(peaks, snap.genome.hg38))
zkcao commented 4 months ago

it worked perfectly now! Really appreciate your help!