kaizhang / SnapATAC2

Single-cell epigenomics analysis tools
https://kzhang.org/SnapATAC2/
223 stars 24 forks source link

import_data(mybedfile.bed, ...) produces an adata with shape()==N x 0 #356

Open mkarikom opened 2 days ago

mkarikom commented 2 days ago

There is nothing imported from snap.datasets.pbmc500(downsample=True).

>>> import snapatac2 as snap
>>> fragment_file = snap.datasets.pbmc500(downsample=True)
>>> data2 = snap.pp.import_data(
...         fragment_file,
...         chrom_sizes=snap.genome.hg38,
...         sorted_by_barcode=False,
...         min_num_fragments=0,
...     )
>>> data2
AnnData object with n_obs × n_vars = 88363 × 0
    obs: 'n_fragment', 'frac_dup', 'frac_mito'
    uns: 'reference_sequences'
    obsm: 'fragment_paired'

I also ran one of the corresponding unit tests, which failed:

(/nfs/turbo/umms-welchjd/mkarikom/shared_snapatac312) [mkarikom@gl1510 snapatac2-python]$ pytest tests/test_tools.py::test_import
========================================================================================================================================== test session starts ==========================================================================================================================================
platform linux -- Python 3.12.7, pytest-8.3.3, pluggy-1.5.0
rootdir: /nfs/turbo/umms-welchjd/mkarikom/snapatac_kaizhang/SnapATAC2/snapatac2-python
configfile: pyproject.toml
plugins: hypothesis-6.115.0, typeguard-4.3.0
collected 1 item                                                                                                                                                                                                                                                                                        

tests/test_tools.py 
F                                                                                                                                                                                                                                                                             [100%]

=============================================================================================================================================== FAILURES ================================================================================================================================================
______________________________________________________________________________________________________________________________________________ test_import ______________________________________________________________________________________________________________________________________________

datadir = local('/tmp/pytest-of-mkarikom/pytest-1/test_import0')

    def test_import(datadir):
        test_files = [snap.datasets.pbmc500(downsample=True), str(datadir.join('test_clean.tsv.gz'))]

        for fl in test_files:
>           data = snap.pp.import_data(
                fl,
                chrom_sizes=snap.genome.hg38,
                min_num_fragments=0,
                sorted_by_barcode=False,
            )

tests/test_tools.py:132: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

fragment_file = '/tmp/pytest-of-mkarikom/pytest-1/test_import0/test_clean.tsv.gz', chrom_sizes = {'chr1': 248956422, 'chr10': 133797422, 'chr11': 135086622, 'chr12': 133275309, ...}

    def import_data(
        fragment_file: Path | list[Path],
        chrom_sizes: Genome | dict[str, int],
        *,
        file: Path | list[Path] | None = None,
        min_num_fragments: int = 200,
        sorted_by_barcode: bool = True,
        whitelist: Path | list[str] | None = None,
        chrM: list[str] = ["chrM", "M"],
        shift_left: int = 0,
        shift_right: int = 0,
        chunk_size: int = 2000,
        tempdir: Path | None = None,
        backend: Literal['hdf5'] = 'hdf5',
        n_jobs: int = 8,
    ) -> internal.AnnData:
        """Import data fragment files and compute basic QC metrics.

        A fragment refers to the sequence data originating from a distinct location
        in the genome. In single-ended sequencing, one read equates to a fragment.
        However, in paired-ended sequencing, a fragment is defined by a pair of reads.
        This function is designed to handle, store, and process input files with
        fragment data, further yielding a range of basic Quality Control (QC) metrics.
        These metrics include the total number of unique fragments, duplication rates,
        and the percentage of mitochondrial DNA detected.

        How fragments are stored is dependent on the sequencing approach utilized.
        For single-ended sequencing, fragments are found in `.obsm['fragment_single']`.
        In contrast, for paired-ended sequencing, they are located in
        `.obsm['fragment_paired']`.

        Diving deeper technically, the fragments are internally structured within a
        Compressed Sparse Row (CSR) matrix. In this configuration, each row signifies
        a specific cell, while each column represents a unique genomic position.
        Fragment starting positions dictate the column indices. Matrix values
        capture the lengths of the fragments for paired-end reads or the lengths of
        the reads for single-ended scenarios. It's important to note that for
        single-ended reads, the values are signed, with the sign providing information
        on the fragment's strand orientation. Additionally, it is worth noting that
        cells may harbor duplicate fragments, leading to the presence of duplicate
        column indices within the matrix. As a result, the matrix deviates from
        the standard CSR format, and it is not advisable to use the matrix for linear
        algebra operations.

        .. image:: /_static/images/func+import_data.svg
            :align: center

        Note
        ----
        - This function accepts both single-end and paired-end reads.
          If the records in the fragment file contain 6 columns with the last column
          representing the strand of the fragment, the fragments are considered single-ended.
          Otherwise, the fragments are considered paired-ended.
        - When `file` is not `None`, this function uses constant memory regardless of
          the size of the input file.
        - When `sorted_by_barcode` is `False`, this function will sort the fragment file
          first, during which temporary files will be created in `tempdir`. The size of
          temporary files is proportional to the number of records in the fragment file.
          For large fragment files, it is recommended to set `tempdir` to a location with
          sufficient space in order to avoid running out of disk space.
        - The QC metrics are computed only for reads that are included by the `whitelist`
          or `chrom_sizes`.

        Warning
        -------
        When the input to the function is a list of files, it employs multiprocessing
        to process these files concurrently. In this case, however, it is crucial to
        safeguard the entry point of the program by encapsulating the function call
        within `if __name__ == '__main__':`. This condition ensures that the module
        is being run as the main program and not being loaded as a module from
        another script. Without this protection, each subprocess might attempt to
        spawn its own subprocesses, leading to a cascade of process spawns—a situation
        that can cause the program to hang or crash due to infinite recursion.
        You don't need to do this in Jupyter notebook as it automatically does that.

        Parameters
        ----------
        fragment_file
            File name of the fragment file, optionally compressed with gzip or zstd.
            This can be a single file or a list of files.
            If it is a list of files, a separate AnnData object will be created for each file.
            A fragment file must contain at least 5 columns:
            chromosome, start, end, barcode, count.
            Optionally it can contain one more column indicating the strand of the fragment.
            When strand is provided, the fragments are considered single-ended.
        chrom_sizes
            A Genome object or a dictionary containing chromosome sizes, for example,
            `{"chr1": 2393, "chr2": 2344, ...}`.
        file
            File name of the output h5ad file used to store the result. If provided,
            result will be saved to a backed AnnData, otherwise an in-memory AnnData
            is used.
            If `fragment_file` is a list of files, `file` must also be a list of files if provided.
        min_num_fragments
            Number of unique fragments threshold used to filter cells
        sorted_by_barcode
            Whether the fragment file has been sorted by cell barcodes.
            This function will be faster if `sorted_by_barcode==True`.
            Note the :func:`~snapatac2.pp.make_fragment_file` will always sort the
            fragment file by barcode.
        whitelist
            File name or a list of barcodes. If it is a file name, each line
            must contain a valid barcode. When provided, only barcodes in the whitelist
            will be retained.
        shift_left
            Insertion site correction for the left end. This is set to 0 by default,
            as shift correction is usually done in the fragment file generation step.
        chrM
            A list of chromosome names that are considered mitochondrial DNA. This is
            used to compute the fraction of mitochondrial DNA.
        shift_right
            Insertion site correction for the right end. Note this has no effect on single-end reads.
            For single-end reads, `shift_right` will be set using the value of `shift_left`.
            This is set to 0 by default, as shift correction is usually done in the fragment
            file generation step.
        chunk_size
            Increasing the chunk_size may speed up I/O but will use more memory.
            The speed gain is usually not significant.
        tempdir
            Location to store temporary files. If `None`, system temporary directory
            will be used.
        backend
            The backend.
        n_jobs
            Number of jobs to run in parallel when `fragment_file` is a list.
            If `n_jobs=-1`, all CPUs will be used.

        Returns
        -------
        AnnData | ad.AnnData
            An annotated data matrix of shape `n_obs` x `n_vars`. Rows correspond to
            cells and columns to regions. If `file=None`, an in-memory AnnData will be
            returned, otherwise a backed AnnData is returned.

        See Also
        --------
        make_fragment_file
        :func:`~snapatac2.ex.export_fragments`

        Examples
        --------
        >>> import snapatac2 as snap
        >>> data = snap.pp.import_data(snap.datasets.pbmc500(downsample=True), chrom_sizes=snap.genome.hg38, sorted_by_barcode=False)
        >>> print(data)
        AnnData object with n_obs × n_vars = 585 × 0
            obs: 'n_fragment', 'frac_dup', 'frac_mito'
            uns: 'reference_sequences'
            obsm: 'fragment_paired'
        """
        chrom_sizes = chrom_sizes.chrom_sizes if isinstance(chrom_sizes, Genome) else chrom_sizes
        if len(chrom_sizes) == 0:
            raise ValueError("chrom_size cannot be empty")

        if whitelist is not None:
            if isinstance(whitelist, str) or isinstance(whitelist, Path):
                with open(whitelist, "r") as fl:
                    whitelist = set([line.strip() for line in fl])
            else:
                whitelist = set(whitelist)

        if isinstance(fragment_file, list):
            n = len(fragment_file)
            if file is None:
                adatas = [AnnData() for _ in range(n)]
            else:
                if len(file) != n:
                    raise ValueError("The length of 'file' must be the same as the length of 'fragment_file'")
                adatas = [internal.AnnData(filename=f, backend=backend) for f in file]

            snapatac2._utils.anndata_ipar(
                list(enumerate(adatas)),
                lambda x: internal.import_fragments(
                    x[1], fragment_file[x[0]], chrom_sizes, chrM, min_num_fragments,
                    sorted_by_barcode, shift_left, shift_right, chunk_size, whitelist, tempdir,
                ),
                n_jobs=n_jobs,
            )
            return adatas
        else:
            adata = AnnData() if file is None else internal.AnnData(filename=file, backend=backend)
>           internal.import_fragments(
                adata, fragment_file, chrom_sizes, chrM, min_num_fragments,
                sorted_by_barcode, shift_left, shift_right, chunk_size, whitelist, tempdir,
            )
E           pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: InputError(Custom { kind: Other, error: "MissingStartPosition: version https://git-lfs.github.com/spec/v1" })

python/snapatac2/preprocessing/_basic.py:335: PanicException
----------------------------------------------------------------------------------------------------------------------------------------- Captured stderr call ------------------------------------------------------------------------------------------------------------------------------------------
thread '<unnamed>' panicked at src/preprocessing.rs:109:14:
called `Result::unwrap()` on an `Err` value: InputError(Custom { kind: Other, error: "MissingStartPosition: version https://git-lfs.github.com/spec/v1" })
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
=========================================================================================================================================== warnings summary ============================================================================================================================================
tests/test_tools.py::test_import
tests/test_tools.py::test_import
tests/test_tools.py::test_import
tests/test_tools.py::test_import
tests/test_tools.py::test_import
  /nfs/turbo/umms-welchjd/mkarikom/snapatac_kaizhang/SnapATAC2/snapatac2-python/python/snapatac2/preprocessing/_basic.py:335: DeprecationWarning: `Series._import_from_c` is deprecated. use _import_arrow_from_c; if you are using an extension, please compile it with latest 'pyo3-polars'
    internal.import_fragments(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================================================================================================== short test summary info ========================================================================================================================================
FAILED tests/test_tools.py::test_import - pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: InputError(Custom { kind: Other, error: "MissingStartPosition: version https://git-lfs.github.com/spec/v1" })
==================================================================================================================================== 1 failed, 5 warnings in 38.14s =====================================================================================================================================
kaizhang commented 2 days ago

pp.import_data doesn't automatically create count matrix. So the shape is correct. You need to call pp.add_tile_matrix after pp.import_data.

mkarikom commented 1 day ago

Thank you for the helpful advice (and such and awesome package)! The first example above now works! The unit test still fails, but maybe it is ok to close this issue?