pyOpenSci / software-submission

Submit your package for review by pyOpenSci here! If you have questions please post them here: https://pyopensci.discourse.group/
90 stars 35 forks source link

OpenOmics: Library for integration of multi-omics, annotation, and interaction data #31

Closed JonnyTran closed 3 years ago

JonnyTran commented 3 years ago

Submitting Author: Jonny Tran (@JonnyTran)
All current maintainers: @JonnyTran Package Name: openomics One-Line Description of Package: Library for integration of multi-omics, annotation, and interaction data Repository Link: https://github.com/JonnyTran/OpenOmics Version submitted: 0.8.4 Editor: @NickleDave
Reviewer 1: @gawbul Reviewer 2: @ksielemann Archive: DOI JOSS DOI: DOI Version accepted: v 0.8.8 Date accepted (month/day/year): 04/17/2021


Description

OpenOmics is a Python library to assist integration of heterogeneous multi-omics bioinformatics data. By providing an API of data manipulation tools as well as a web interface (WIP), OpenOmics facilitates the common coding tasks when preparing data for bioinformatics analysis. It features support for:

OpenOmics also has an efficient data pipeline that bridges the popular data manipulation Pandas library and Dask distributed processing to address the following use cases:

Scope

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see notes on categories of our guidebook.

OpenOmics' core functionalities are to provide a suite of tools for data preprocessing, data integration, and public database retrieval. Its main goal is to maximize the transparency and reproducibility in the process of multi-omics data integration.

OpenOmics' primary target audience are computational bioinformaticians, and the scientific application of this package is to provide scalable ad-hoc data-frame manipulation for multi-omics data integration in a reproducible manner. Also, we are currently developing an interactive web dashboard and interfaces to the Galaxy Tool Shed, disseminating the tool to biologists without a programming background.

Existing PyPI Python packages within the scope of multi-omics data analysis are "pythomics" and "omics". Their functions appear to be lacking support for manipulation of integrated multi-omics dataset, retrieval of public databases, and extensible OOP design. OpenOmics aims to follow modern software best-practices and package publishing standards.

Aside from multi-omics integration tools, several specialized Python packages exists for single omics data, such as ScanPy's "AnnData" and "Loom" files. They provide an intuitive data structure for expression arrays and side annotations, and Loom file even allows for out-of-core data-frame processing. However, they don't yet provide mechanisms for multi-omics data integration, where each omics data may have overlapping samples or varying row/column sizes.

https://github.com/pyOpenSci/software-review/issues/30

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

Publication options

JOSS Checks - [x] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [x] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [x] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [x] The package is deposited in a long-term repository with the DOI: 10.5281/zenodo.4441167 *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

Code of conduct

P.S. *Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

lwasser commented 3 years ago

welcome to pyopensci @JonnyTran someone will followup with you in the next week or so! Our progress is slow right now as we're working o n funding for this effort!! @NickleDave was this one you wanted to work on? you are doing a lot so please let me know if you have time (or if you don't that is understandable too!)

NickleDave commented 3 years ago

@JonnyTran this looks perfect, thank you. Sorry for not replying when you opened this issue.

Hey @lwasser yes I have this on my to-do list and will start looking for reviewers Wednesday

lwasser commented 3 years ago

wonderful! @NickleDave thank you!

NickleDave commented 3 years ago

Hi all, just adding editor checks.

Thank you @JonnyTran for your detailed submission

Editor checks:

submitter did not check "yes" for submitting to JOSS -- edit 2021-01-22: reviewer will submit to JOSS. Has added DOI to repository in response to my initial comments


Editor comments

Overall:

Minor comments: things I notice on first glance


Reviewers: @gawbul @ksielemann Due date: February 15, 2021

NickleDave commented 3 years ago

started reaching out to reviewers, will update as soon as we hear back

JonnyTran commented 3 years ago

Minor comments:

  • no DOI for releases, e.g. using Zenodo integration. Good idea to have DOI to make releases citable, e.g. in papers to make explicit which version was used
  • no explicit link to docs page

Thanks so much for the great suggestions, @NickleDave. I've addressed the "releases DOI" and "docs link" issues above, and I'm planning on updating the README file to be more attractive to potential contributors.

JonnyTran commented 3 years ago

Hi @NickleDave and @lwasser, will it be possible for OpenOmics to make submission to JOSS at this stage of the review? I just changed my mind about this, but I already have a complete manuscript and I'm ready to provide the paper.md within a few days.

NickleDave commented 3 years ago

Hey @JonnyTran I think that could be okay. But first let me make sure I understand what you're asking.

Hi @NickleDave and @lwasser, will it be possible for OpenOmics to make submission to JOSS at this stage of the review?

Do you mean that you want to change "Publication Options" in your submission above, and check the box for "automatically submit to" JOSS? If so, yes I think that's fine. @lwasser please confirm

I just changed my mind about this, but I already have a complete manuscript and I'm ready to provide the paper.md within a few days.

Do you mean that you have a complete, separate manuscript written about OpenOmics, in addition to the paper.md you would submit to JOSS? E.g., like Physcraper https://github.com/pyOpenSci/software-review/issues/26 which has a paper on biorxiv https://www.biorxiv.org/content/10.1101/2020.09.15.299156v1

If that's the case, we should discuss more whether you want to submit to JOSS. We can perhaps tag some editors and ask if they can give us input. My impression is that JOSS is usually meant to provide a mechanism for getting publication credit for software in cases where the developer/maintainer can't easily publish a paper about it. Although I think they might have changed some of the language in their submission guidelines about this. See for example: https://joss.readthedocs.io/en/latest/submitting.html#co-publication-of-science-methods-and-software

so:
if you just want to go through PyOpenSci review and then submit to JOSS at the end, yes, totally fine assuming @lwasser agrees with me. Anything else, we should probably discuss a little more first

NickleDave commented 3 years ago

Hi @JonnyTran just want to follow up on this -- please let me know what you're thinking

I think I do have one potential reviewer and can move ahead with finding another whenever you're ready

JonnyTran commented 3 years ago

Hi @NickleDave, sorry it took awhile to consult with my advisor. Yes, I meant that I'd like to check the box on automatic submission, and no, that I have not submitted a separate manuscript elsewhere. My intention for the JOSS submission is to publish contributions on the technical software aspects that I have thus far. But in later months (after finishing the web-app features in openomics), I do plan on making another contribution to a bioinformatics journal on the scientific bioinformatics use-cases. So I think this would fall under the "co-publication".

Thank you for the clarifications!

NickleDave commented 3 years ago

great, glad to hear it @JonnyTran
and I totally understand needing to consult with your advisor--didn't mean to rush you, just don't want this to fall of my to-do list

please do go ahead and edit your initial comment to check that automatic submission box, and make sure you address the to-do list in that section

I will continue with my part of the review process

NickleDave commented 3 years ago

Hi again @JonnyTran excited to let you know that @gawbul and @ksielemann have both kindly accepted our invitation to review

@gawbul actually started PyOpenSci back in 2013 (see this ROpenSci blogpost) and develops related tools such as pyEnsemblREST

@ksielemann has significant experience with omics datasets and was recommended to us by @bpucker as developer of the tool QUOD (from their publication https://www.biorxiv.org/content/10.1101/2020.04.28.065714v1.abstract)

@gawbul and @ksielemann here are related links again for your convenience: Our reviewers guide details what we look for in a package review, and includes links to sample reviews. Our standards are detailed in our packaging guide, and we provide a reviewer template for you to use. Please make sure you do not have a conflict of interest preventing you from reviewing this package. If you have questions or feedback, feel free to ask me here or by email, or post to the pyOpenSci forum.

I will update my editor checks above to add you both as reviewers, and set an initial due date of three weeks: February 15, 2021

lwasser commented 3 years ago

this is so awesome!!

@gawbul hello again!! so great to see you here. We are moving forward with PyOS through the Sloan foundation (fingers crossed) as we briefly discussed forever ago. I'd love to see you continue to participate when you have time in whatever capacity you have time for!! :)

thank you all for this review!

ksielemann commented 3 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

It is not completely clear what OpenOmics can be used for. An overview of the available functions and methods would be great: What exactly does OpenOmics do? What are specific usage examples (also e.g. after using OpenOmics: What's next?)?:

'OpenOmics facilitates the common coding tasks when preparing data for bioinformatics analysis.': For which bioinformatic analyses exactly?

'# Load each expression dataframe': additional ')' in lines should be removed! as current form results in an error

mRNA = MessengerRNA(data=folder_path+"LUAD__geneExp.txt", transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="gene_name") Results in warnings (only in first time use): _'/homes/.local/lib/python3.7/site-packages/openomics/transcriptomics.py:95: FutureWarning: read_table is deprecated, use read_csv instead. /homes/.local/lib/python3.7/site-packages/openomics/transcriptomics.py:95: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support sep=None with delimwhitespace=False; you can avoid this warning by specifying engine='python'.'

som = SomaticMutation(data=folder_path+"LUAD__somaticMutation_geneLevel.txt", transpose=True, usecols="GeneSymbol|TCGA", gene_index="gene_name") Results in: _'KeyError: 'genename''. This should probably be _'geneindex="GeneSymbol"'.

luad_data.add_clinical_data(clinical_data=folder_path+"nationwidechildrens.org_clinical_patient_luad.txt") Results in warning: _'/homes/.local/lib/python3.7/site-packages/openomics/clinical.py:51: FutureWarning: read_table is deprecated, use readcsv instead, passing sep='\t'.'

gencode = GENCODE(path="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/",file_resources={"long_noncoding_RNAs.gtf": "gencode.v32.long_noncoding_RNAs.gtf.gz","basic.annotation.gtf": "gencode.v32.basic.annotation.gtf.gz","lncRNA_transcripts.fa": "gencode.v32.lncRNA_transcripts.fa.gz","transcripts.fa": "gencode.v32.transcripts.fa.gz"},remove_version_num=True,npartitions=5) Results in: _'AttributeError: 'io.TextIOWrapper' object has no attribute 'startswith''.

Please see above (in 'A statement of need').

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Please see above (in 'A statement of need'). The goals could be communicated more specifically.

Citation information is missing at the end of the README.

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider:

Please see above (in 'A statement of need'). Specific examples for bioinformatic analyses after the use of OpenOmics could be added.

Please see above (in 'A statement of need'). An overview of all methods and functions of the package would be helpful.

Functionality

Installation with pip install openomics worked fine on the Linux system I am using. However, using my windows computer, I got the following error: error: Microsoft Visual C++ 14.0 is required

A list of dependencies/requirements in the README would be great.

from openomics import MultiOmics results in the "UserWarning: Tensorflow not installed; ParametricUMAP will be unavailable"

Please see above.

Please see above.

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing: 5


Review Comments

The concept of OpenOmics seems interesting and useful for the integration of various omics datasets. Overall, the package has a clear documentation. However, there are still a few issues that should be addressed. Please see the points above and below.

#fetch data
import os
import tarfile
import urllib.request

def fetch_data(file_url, own_path, file_name):
    if not os.path.isdir(own_path):
        os.makedirs(own_path)
    own_file_path = os.path.join(own_path, file_name)
    urllib.request.urlretrieve(file_url, own_file_path)

FILE_NAMES = ["LUAD__geneExp.txt",
              "LUAD__miRNAExp__RPM.txt",
              "LUAD__protein_RPPA.txt",
              "LUAD__somaticMutation_geneLevel.txt",
              "TCGA-rnaexpr.tsv",
              "genome.wustl.edu_biospecimen_sample_luad.txt",
              "nationwidechildrens.org_clinical_drug_luad.txt",
              "nationwidechildrens.org_clinical_patient_luad.txt",
              "protein_RPPA.txt"]

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/BioMeCIS-Lab/OpenOmics/master/tests/data/TCGA_LUAD/"
OWN_PATH = os.path.join("data", "omics")

for file_name in FILE_NAMES:
  FILE_URL = DOWNLOAD_ROOT + file_name
  fetch_data(FILE_URL, OWN_PATH, file_name) 

gtex = GTEx(path="https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/") Results in: OSError: Not enough free space in /homes/.astropy/cache/download/url to download a 3.6G file, only 2.7G left.

Is there a possibility to choose the directory in which the files should be downloaded? This would be great.

JonnyTran commented 3 years ago

Thanks so much for the fantastic reviews @ksielemann! I can work on the revisions, which should be available in 2 weeks.

Download of test data. Add code to download the data within a Python script so that the user does not have to download the whole repository or describe exactly how to download the test data.

Thanks for pointing this out and providing the automated scripts. I initially placed the test data at tests/data/ for automated tests when you run pytest ./ at the root directory. You can actually copy the codes at tests/test_multiomics.py to load the -omics data.

gtex = GTEx(path="https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/") Results in: OSError: Not enough free space in /homes/.astropy/cache/download/url to download a 3.6G file, only 2.7G left.

I used the package astropy to automatically cache downloaded files. It defaults to saving the files at /homes/.astropy/cache/, and ideally it should be in one location for each user-session. But I can see how useful it is for the user to be able to make a setting to choose a directory of their choice - I will see about making an openomics configuration file at the user directory, located at ~/.openomics/conf.json. I've made an issue at https://github.com/BioMeCIS-Lab/OpenOmics/issues/112

NickleDave commented 3 years ago

Just echoing @JonnyTran -- yes thank you for getting this detailed review back so quickly @ksielemann

Looks great to me. I will read in detail this weekend just to make sure I'm staying up to date with the review process

gawbul commented 3 years ago

Just working through my review at present. Should have it done by the end of the day. Apologies for the delay.

gawbul commented 3 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

So the package states it is solving the problem of targeting various multi-omics datasets, though this is fairly broad, perhaps because the intention is to make the scope of the project broader in the future. It isn't clear what datasets and data formats are supported, however. Perhaps this isn't relevant for the README, but I feel it could be included in the linked documentation on Read the Docs. The target audience is implicitly defined and my assumption is that this would primarily be used by bioinformaticians, though perhaps this could be more explicit?

Installation instructions seemed clear and I attempted to install via pip, however initially received the following error:

Installing collected packages: MarkupSafe, Werkzeug, numpy, Jinja2, itsdangerous, zope.interface, zope.event, urllib3, threadpoolctl, scipy, retrying, PyYAML, pytz, python-dateutil, pyparsing, ptyprocess, llvmlite, joblib, idna, heapdict, greenlet, Flask, chardet, certifi, brotli, zict, typing-extensions, tornado, toolz, tblib, soupsieve, sortedcontainers, scikit-learn, requests, psutil, plotly, pillow, pexpect, patsy, pandas, packaging, numba, msgpack, locket, gevent, future, flask-compress, decorator, dask, dash-table, dash-renderer, dash-html-components, dash-core-components, colorlog, cloudpickle, xmltodict, xlsxwriter, xlrd, wrapt, wget, suds-jurko, statsmodels, requests-cache, pynndescent, pyerfa, pydot, partd, networkx, lxml, kiwisolver, grequests, fsspec, easydev, docopt, distributed, dash, cython, cycler, cachetools, bokeh, beautifulsoup4, appdirs, validators, umap-learn, typing, sqlalchemy, scikit-allel, rarfile, obonet, matplotlib, large-image, h5py, gunicorn, gtfparse, goatools, filetype, dash-daq, dash-bootstrap-components, bioservices, biopython, astropy, openomics
    Running setup.py install for retrying ... done
    Running setup.py install for llvmlite ... error
    ERROR: Command errored out with exit status 1:
     command: /Users/stephenmoss/.pyenv/versions/3.9.0/bin/python3.9 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/bf/cl_g6mhx7zd9_1mhhvppbzdh0000gn/T/pip-install-sm6pl876/llvmlite_0bff606e31a6496399a22ccfcec04d59/setup.py'"'"'; __file__='"'"'/private/var/folders/bf/cl_g6mhx7zd9_1mhhvppbzdh0000gn/T/pip-install-sm6pl876/llvmlite_0bff606e31a6496399a22ccfcec04d59/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/bf/cl_g6mhx7zd9_1mhhvppbzdh0000gn/T/pip-record-rgduko9u/install-record.txt --single-version-externally-managed --compile --install-headers /Users/stephenmoss/.pyenv/versions/3.9.0/include/python3.9/llvmlite
         cwd: /private/var/folders/bf/cl_g6mhx7zd9_1mhhvppbzdh0000gn/T/pip-install-sm6pl876/llvmlite_0bff606e31a6496399a22ccfcec04d59/
    Complete output (29 lines):
    running install
    running build
    got version from file /private/var/folders/bf/cl_g6mhx7zd9_1mhhvppbzdh0000gn/T/pip-install-sm6pl876/llvmlite_0bff606e31a6496399a22ccfcec04d59/llvmlite/_version.py {'version': '0.34.0', 'full': 'c5889c9e98c6b19d5d85ebdd982d64a03931f8e2'}
    running build_ext
    /Users/stephenmoss/.pyenv/versions/3.9.0/bin/python3.9 /private/var/folders/bf/cl_g6mhx7zd9_1mhhvppbzdh0000gn/T/pip-install-sm6pl876/llvmlite_0bff606e31a6496399a22ccfcec04d59/ffi/build.py
    LLVM version... Traceback (most recent call last):
      File "/private/var/folders/bf/cl_g6mhx7zd9_1mhhvppbzdh0000gn/T/pip-install-sm6pl876/llvmlite_0bff606e31a6496399a22ccfcec04d59/ffi/build.py", line 105, in main_posix
        out = subprocess.check_output([llvm_config, '--version'])
      File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/subprocess.py", line 420, in check_output
        return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
      File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/subprocess.py", line 501, in run
        with Popen(*popenargs, **kwargs) as process:
      File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/subprocess.py", line 947, in __init__
        self._execute_child(args, executable, preexec_fn, close_fds,
      File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/subprocess.py", line 1819, in _execute_child
        raise child_exception_type(errno_num, err_msg, err_filename)
    FileNotFoundError: [Errno 2] No such file or directory: 'llvm-config'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/private/var/folders/bf/cl_g6mhx7zd9_1mhhvppbzdh0000gn/T/pip-install-sm6pl876/llvmlite_0bff606e31a6496399a22ccfcec04d59/ffi/build.py", line 191, in <module>
        main()
      File "/private/var/folders/bf/cl_g6mhx7zd9_1mhhvppbzdh0000gn/T/pip-install-sm6pl876/llvmlite_0bff606e31a6496399a22ccfcec04d59/ffi/build.py", line 185, in main
        main_posix('osx', '.dylib')
      File "/private/var/folders/bf/cl_g6mhx7zd9_1mhhvppbzdh0000gn/T/pip-install-sm6pl876/llvmlite_0bff606e31a6496399a22ccfcec04d59/ffi/build.py", line 107, in main_posix
        raise RuntimeError("%s failed executing, please point LLVM_CONFIG "
    RuntimeError: llvm-config failed executing, please point LLVM_CONFIG to the path for llvm-config
    error: command '/Users/stephenmoss/.pyenv/versions/3.9.0/bin/python3.9' failed with exit code 1
    ----------------------------------------
ERROR: Command errored out with exit status 1: /Users/stephenmoss/.pyenv/versions/3.9.0/bin/python3.9 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/bf/cl_g6mhx7zd9_1mhhvppbzdh0000gn/T/pip-install-sm6pl876/llvmlite_0bff606e31a6496399a22ccfcec04d59/setup.py'"'"'; __file__='"'"'/private/var/folders/bf/cl_g6mhx7zd9_1mhhvppbzdh0000gn/T/pip-install-sm6pl876/llvmlite_0bff606e31a6496399a22ccfcec04d59/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/bf/cl_g6mhx7zd9_1mhhvppbzdh0000gn/T/pip-record-rgduko9u/install-record.txt --single-version-externally-managed --compile --install-headers /Users/stephenmoss/.pyenv/versions/3.9.0/include/python3.9/llvmlite Check the logs for full command output.

I needed to run the following to fix the issue:

brew install llvm@9
LLVM_CONFIG=/usr/local/opt/llvm@9/bin/llvm-config pip install openomics

_I tried with brew install llvm (version 11.0.1 and it failed with RuntimeError: Building llvmlite requires LLVM 10.0.x or 9.0.x, got '11.0.1'. Be sure to set LLVM_CONFIG to the right executable path._

Perhaps an external dependency on this can be specified (required by llvmlite)? The assumption is that the end-user has a working python installation and relevant compilers etc. installed, though this doesn't seem to be specified anywhere?

When trying to run an openomics_test.py file with the from openomics import MultiOmics statement I received the following:

Creating directory /Users/stephenmoss/Library/Application Support/bioservices
Matplotlib is building the font cache; this may take a moment.
Traceback (most recent call last):
  File "/Users/stephenmoss/Dropbox/Code/openomics_test.py", line 1, in <module>
    from openomics import MultiOmics
  File "/Users/stephenmoss/Dropbox/Code/OpenOmics/openomics/__init__.py", line 40, in <module>
    from .visualization import (
  File "/Users/stephenmoss/Dropbox/Code/OpenOmics/openomics/visualization/umap.py", line 3, in <module>
    import umap
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/umap/__init__.py", line 2, in <module>
    from .umap_ import UMAP
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/umap/umap_.py", line 47, in <module>
    from pynndescent import NNDescent
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/pynndescent/__init__.py", line 3, in <module>
    from .pynndescent_ import NNDescent, PyNNDescentTransformer
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/pynndescent/pynndescent_.py", line 21, in <module>
    import pynndescent.sparse as sparse
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/pynndescent/sparse.py", line 330, in <module>
    def sparse_alternative_jaccard(ind1, data1, ind2, data2):
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/decorators.py", line 218, in wrapper
    disp.compile(sig)
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/compiler_lock.py", line 32, in _acquire_compile_lock
    return func(*args, **kwargs)
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/dispatcher.py", line 819, in compile
    cres = self._compiler.compile(args, return_type)
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/dispatcher.py", line 82, in compile
    raise retval
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/dispatcher.py", line 92, in _compile_cached
    retval = self._compile_core(args, return_type)
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/dispatcher.py", line 105, in _compile_core
    cres = compiler.compile_extra(self.targetdescr.typing_context,
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/compiler.py", line 627, in compile_extra
    return pipeline.compile_extra(func)
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/compiler.py", line 363, in compile_extra
    return self._compile_bytecode()
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/compiler.py", line 425, in _compile_bytecode
    return self._compile_core()
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/compiler.py", line 405, in _compile_core
    raise e
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/compiler.py", line 396, in _compile_core
    pm.run(self.state)
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/compiler_machinery.py", line 341, in run
    raise patched_exception
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/compiler_machinery.py", line 332, in run
    self._runPass(idx, pass_inst, state)
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/compiler_lock.py", line 32, in _acquire_compile_lock
    return func(*args, **kwargs)
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/compiler_machinery.py", line 291, in _runPass
    mutated |= check(pss.run_pass, internal_state)
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/compiler_machinery.py", line 264, in check
    mangled = func(compiler_state)
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/typed_passes.py", line 92, in run_pass
    typemap, return_type, calltypes = type_inference_stage(
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/typed_passes.py", line 70, in type_inference_stage
    infer.propagate(raise_errors=raise_errors)
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/typeinfer.py", line 1071, in propagate
    raise errors[0]
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Failed in nopython mode pipeline (step: nopython frontend)
Failed in nopython mode pipeline (step: nopython mode backend)
Failed in nopython mode pipeline (step: nopython mode backend)
Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<function make_quicksort_impl.<locals>.run_quicksort at 0x13e3a6940>) found for signature:

 >>> run_quicksort(array(int32, 1d, C))

There are 2 candidate implementations:
  - Of which 2 did not match due to:
  Overload in function 'register_jitable.<locals>.wrap.<locals>.ov_wrap': File: numba/core/extending.py: Line 150.
    With argument(s): '(array(int32, 1d, C))':
   Rejected as the implementation raised a specific error:
     UnsupportedError: Failed in nopython mode pipeline (step: analyzing bytecode)
   Use of unsupported opcode (LOAD_ASSERTION_ERROR) found

   File "../../../.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/misc/quicksort.py", line 180:
       def run_quicksort(A):
           <source elided>
               while high - low >= SMALL_QUICKSORT:
                   assert n < MAX_STACK
                   ^

  raised from /Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/byteflow.py:269

During: resolving callee type: Function(<function make_quicksort_impl.<locals>.run_quicksort at 0x13e3a6940>)
During: typing of call at /Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/np/arrayobj.py (5007)

File "../../../.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/np/arrayobj.py", line 5007:
    def array_sort_impl(arr):
        <source elided>
        # Note we clobber the return value
        sort_func(arr)
        ^

During: lowering "$14call_method.5 = call $12load_method.4(func=$12load_method.4, args=[], kws=(), vararg=None)" at /Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/np/arrayobj.py (5017)
During: lowering "$8call_method.3 = call $4load_method.1(arr, func=$4load_method.1, args=[Var(arr, sparse.py:28)], kws=(), vararg=None)" at /Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/pynndescent/sparse.py (28)
During: resolving callee type: type(CPUDispatcher(<function arr_unique at 0x13e1d4280>))
During: typing of call at /Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/pynndescent/sparse.py (41)

During: resolving callee type: type(CPUDispatcher(<function arr_unique at 0x13e1d4280>))
During: typing of call at /Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/pynndescent/sparse.py (41)

File "../../../.pyenv/versions/3.9.0/lib/python3.9/site-packages/pynndescent/sparse.py", line 41:
def arr_union(ar1, ar2):
    <source elided>
    else:
        return arr_unique(np.concatenate((ar1, ar2)))
        ^

During: resolving callee type: type(CPUDispatcher(<function arr_union at 0x13e1d4820>))
During: typing of call at /Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/pynndescent/sparse.py (331)

During: resolving callee type: type(CPUDispatcher(<function arr_union at 0x13e1d4820>))
During: typing of call at /Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/pynndescent/sparse.py (331)

File "../../../.pyenv/versions/3.9.0/lib/python3.9/site-packages/pynndescent/sparse.py", line 331:
def sparse_alternative_jaccard(ind1, data1, ind2, data2):
    num_non_zero = arr_union(ind1, ind2).shape[0]
    ^

This turned out to be an issue with python 3.9, which the package is supposed to support. I tried with python 3.8 instead, but it failed initially with the scipy install as it needed the BLAS and LAPACK libraries. I needed to install using:

brew install openblas lapack
LDFLAGS="-L/usr/local/opt/openblas/lib -L/usr/local/opt/lapack/lib" CPPFLAGS="-I/usr/local/opt/openblas/include -I/usr/local/opt/lapack/include" LLVM_CONFIG=/usr/local/opt/llvm@9/bin/llvm-config pip install openomics

Now running python openomics_test.py gives me only the following warning:

/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/umap/__init__.py:9: UserWarning: Tensorflow not installed; ParametricUMAP will be unavailable
  warn("Tensorflow not installed; ParametricUMAP will be unavailable")

Running the following rectifies this:

brew install libtensorflow
pip install tensorflow

I feel, therefore, a dependency on python 3.8 should be specified in the documentation and setup.py, as it looks like there are issues with python 3.9 at present. It would also be useful to include tensorflow in the list of package dependencies (i.e. requirements.txt) to avoid this warning. Using something like pipenv might be an ideal solution here? Though explicitly stating the external library dependencies for scipy would still be necessary.

When extending openomics_test.py to include the example for Load the multiomics: Gene Expression, MicroRNA expression lncRNA expression, Copy Number Variation, Somatic Mutation, DNA Methylation, and Protein Expression data, I get the following error running the first sample:

File "openomics_test.py", line 11
  usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="transcript_name")
  ^
IndentationError: unexpected indent

This is because there are close parentheses where there shouldn't be.

The example in the README should be the following (I have submitted a PR for this):

# Load each expression dataframe
mRNA = MessengerRNA(data=folder_path+"LUAD__geneExp.txt",
        transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="gene_name")
miRNA = MicroRNA(data=folder_path+"LUAD__miRNAExp__RPM.txt",
        transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="transcript_name")
lncRNA = LncRNA(data=folder_path+"TCGA-rnaexpr.tsv",
        transpose=True, usecols="Gene_ID|TCGA", gene_index="Gene_ID", gene_level="gene_id")
som = SomaticMutation(data=folder_path+"LUAD__somaticMutation_geneLevel.txt",
        transpose=True, usecols="GeneSymbol|TCGA", gene_index="gene_name")
pro = Protein(data=folder_path+"protein_RPPA.txt",
        transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="protein_name")

Running openomics_test.py now gives me the following warning for the MessengerRNA function:

/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/transcriptomics.py:95: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support sep=None with delim_whitespace=False; you can avoid this warning by specifying engine='python'.

I believe this can be rectified by updating transcriptomics.py here with:

df = pd.read_table(data, sep=None, engine='python')

The SomaticMutation function gives me the following error:

Traceback (most recent call last):
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'gene_name'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "openomics_test.py", line 14, in <module>
    som = SomaticMutation(data=folder_path+"LUAD__somaticMutation_geneLevel.txt",
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/genomics.py", line 22, in __init__
    super(SomaticMutation, self).__init__(data=data, transpose=transpose, gene_index=gene_index, usecols=usecols,
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/transcriptomics.py", line 50, in __init__
    self.expressions = self.preprocess_table(df, usecols=usecols, gene_index=gene_index, transposed=transpose,
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/transcriptomics.py", line 148, in preprocess_table
    df = df[df[gene_index] != '?']
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/pandas/core/frame.py", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: 'gene_name'

Using GeneSymbol instead of gene_name for the gene_index parameter in the vignette fixes this, e.g.:

som = SomaticMutation(data=folder_path+"LUAD__somaticMutation_geneLevel.txt",
        transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol")

Running openomics_test.py now gives me the following:

/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/transcriptomics.py:95: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support sep=None with delim_whitespace=False; you can avoid this warning by specifying engine='python'.
  df = pd.read_table(data, sep=None)
MessengerRNA (576, 20472) , indexed by: gene_name
MicroRNA (494, 1870) , indexed by: transcript_name
LncRNA (546, 12727) , indexed by: gene_id
SomaticMutation (889, 21070) , indexed by: GeneSymbol
Protein (364, 200) , indexed by: protein_name

This differs from the output in the README, however, which is:

PATIENTS (522, 5)
SAMPLES (1160, 6)
DRUGS (461, 4)
MessengerRNA (576, 20472)
SomaticMutation (587, 21070)
MicroRNA (494, 1870)
LncRNA (546, 12727)
Protein (364, 154)

Running the example under Annotate LncRNAs with GENCODE genomic annotations returns the following:

Downloading ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.long_noncoding_RNAs.gtf.gz
|========================================================================================================================================================================================================| 4.4M/4.4M (100.00%)         0s
Downloading ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.basic.annotation.gtf.gz
|========================================================================================================================================================================================================|  26M/ 26M (100.00%)         7s
Downloading ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.lncRNA_transcripts.fa.gz
|========================================================================================================================================================================================================|  14M/ 14M (100.00%)         3s
Downloading ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.transcripts.fa.gz
|========================================================================================================================================================================================================|  72M/ 72M (100.00%)        15s
INFO:root:<_io.TextIOWrapper name='/Users/stephenmoss/.astropy/cache/download/url/141581d04d4001254d07601dfa7d983b/contents' encoding='UTF-8'>
Traceback (most recent call last):
  File "openomics_test.py", line 34, in <module>
    gencode = GENCODE(path="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/",
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/database/sequence.py", line 67, in __init__
    super(GENCODE, self).__init__(path=path, file_resources=file_resources, col_rename=col_rename,
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/database/sequence.py", line 17, in __init__
    super(SequenceDataset, self).__init__(**kwargs)
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/database/base.py", line 39, in __init__
    self.data = self.load_dataframe(file_resources, npartitions=npartitions)
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/database/sequence.py", line 74, in load_dataframe
    df = read_gtf(file_resources[gtf_file], npartitions=npartitions)
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/utils/read_gtf.py", line 349, in read_gtf
    result_df = parse_gtf_and_expand_attributes(
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/utils/read_gtf.py", line 290, in parse_gtf_and_expand_attributes
    result = parse_gtf(
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/utils/read_gtf.py", line 195, in parse_gtf
    chunk_iterator = dd.read_table(
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/dask/dataframe/io/csv.py", line 659, in read
    return read_pandas(
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/dask/dataframe/io/csv.py", line 464, in read_pandas
    paths = get_fs_token_paths(urlpath, mode="rb", storage_options=storage_options)[
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/fsspec/core.py", line 619, in get_fs_token_paths
    path = cls._strip_protocol(urlpath)
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/fsspec/implementations/local.py", line 147, in _strip_protocol
    if path.startswith("file://"):
AttributeError: '_io.TextIOWrapper' object has no attribute 'startswith'

I initially submitted an issue with fsspec here, but it turned out to be due to the gzip library wrapping the return object in an io.TextIOWrapper object? This is due to the fsspec package being unable to determine a path for the object and therefore returning the original object, which then fails downstream. See the issue for more details, but we somehow need to return a legitimate file object, instead of an io.TextIOWrapper object that has a resolvable path and therefore doesn't cause the failure in the fsspec package.

I was unable to continue with the rest of the vignettes as a result despite attempting some fixes locally.

Docstrings seem to be available throughout the codebase for all relevant functions, however, I found the documentation on Read the Docs to be lacking in comparison, particularly in providing detail of function parameters that are otherwise available in the docstrings.

Looking at the various classes and respective functions throughout the codebase, there doesn't seem to be a full coverage of of all functionality in the examples provided in the documentation. This would be too much for the README, I feel, but could certainly be represented on Read the Docs.

There is no link to the contribution guidelines from the main README, which I feel would be beneficial, however contributing guidelines are available in https://github.com/BioMeCIS-Lab/OpenOmics/blob/master/CONTRIBUTING.rst and on Read the Docs.

Just being finicky, I personally would prefer a common format for the README and CONTRIBUTING documents etc. Either both in reStructuredText or both in Markdown. There seems to be an outdated README.rst that could probably be replaced with the current README.md?

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

I feel some things are missing regarding setup information for certain platforms (i.e. I had issues on macOS). Use of pipenv may be beneficial for certain dependencies, i.e. those that generated warnings. Reproducibiltiy across platforms is always an issue, though perhaps a Docker image with all dependencies required could be made available for people to run their scripts?

No comparisons are made to other packages. I feel it is similar in some ways to PyCogent, though this is no longer actively maintained, and perhaps even comparable in some ways to QIIME2?

No citation information is provided in the README, though perhaps this can be made available if submitted to JOSS. It would also be great to see a DOI made available via GitHub's integration with Zenodo as mentioned by @NickleDave.

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider:

One minor PR submitted as described above. I'll try see if I can spare some time to look into the gzip issue at some point too.

Though could be extended as discussed above.

This is relatively clear, but could be expanded on as above.

As discussed above, there appear to be docstrings throughout the code, but this doesn't seem to be reflected in the README or online documentation, particularly for function parameters.

Functionality

I had several issues with installation as described above, though I only tested on a macOS system running Big Sur version 11.2. It appears there are issues with Python 3.9.x in particular, though I can't confirm whether this is platform specific. There were some failures in building scipy on Python 3.8.7 due to dependent libraries missing on my machine.

The package goes a good way towards meeting the claims it reports to be developed for, though the difficulties in getting the examples to run means I was limited in my ability to fully assess them.

No performance claims were provided. On my 16" MacBook Pro with 2.3 GHz Octa-core Intel Core i9 and 32GB RAM, however, I felt the package was a little slow in loading it's dependencies on first run. Tests also took some time to complete.

Some tests are available and run as part of the Travis CI pipeline, though coverage isn't amazing and would benefit from additional work. Focusing on test driven development is a good way to ensure greater coverage. Running the tests locally took a long time, and returned various warnings and an error.

☁ OpenOmics [master] python -m pytest --cov=./
============================================================================================================ test session starts =============================================================================================================
platform darwin -- Python 3.8.7, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
rootdir: /Users/stephenmoss/Dropbox/Code/OpenOmics, configfile: setup.cfg
plugins: cov-2.11.1, dash-1.19.0
collected 35 items

tests/test_annotations.py .........                                                                                                                                                                                                    [ 25%]
tests/test_disease.py ..........                                                                               [ 54%]
tests/test_interaction.py .E.....                                                                              [ 74%]
tests/test_multiomics.py ...                                                                                   [ 82%]
tests/test_sequences.py ......                                                                                 [100%]

======================================================= ERRORS =======================================================
______________________________________ ERROR at setup of test_import_MiRTarBase ______________________________________

    @pytest.fixture
    def generate_MiRTarBase():
>       return MiRTarBase(path="/data/datasets/Bioinformatics_ExternalData/miRTarBase/", strip_mirna_name=True,
                          filters={"Species (Target Gene)": "Homo sapiens"})

tests/test_interaction.py:19:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
openomics/database/interaction.py:611: in __init__
    super(MiRTarBase, self).__init__(path=path, file_resources=file_resources,
openomics/database/interaction.py:40: in __init__
    self.validate_file_resources(path, file_resources, verbose=verbose)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <openomics.database.interaction.MiRTarBase object at 0x1aa5036d0>
path = '/data/datasets/Bioinformatics_ExternalData/miRTarBase/'
file_resources = {'miRTarBase_MTI.xlsx': '/data/datasets/Bioinformatics_ExternalData/miRTarBase/miRTarBase_MTI.xlsx'}
npartitions = None, verbose = False

    def validate_file_resources(self, path, file_resources, npartitions=None, verbose=False) -> None:
        """For each file in file_resources, fetch the file if path+file is a URL
        or load from disk if a local path. Additionally unzip or unrar if the
        file is compressed.

        Args:
            path (str): The folder or url path containing the data file
                resources. If url path, the files will be downloaded and cached
                to the user's home folder (at ~/.astropy/).
            file_resources (dict): default None, Used to list required files for
                preprocessing of the database. A dictionary where keys are
                required filenames and value are file paths. If None, then the
                class constructor should automatically build the required file
                resources dict.
            npartitions:
            verbose:
        """
        if validators.url(path):
            for filename, filepath in copy.copy(file_resources).items():
                data_file = get_pkg_data_filename(path, filepath,
                                                  verbose=verbose)  # Download file and replace the file_resource path
                filetype_ext = filetype.guess(data_file)

                # This null if-clause is needed incase when filetype_ext is None, causing the next clause to fail
                if filetype_ext is None:
                    file_resources[filename] = data_file

                elif filetype_ext.extension == 'gz':
                    file_resources[filename] = gzip.open(data_file, 'rt')

                elif filetype_ext.extension == 'zip':
                    zf = zipfile.ZipFile(data_file, 'r')

                    for subfile in zf.infolist():
                        if os.path.splitext(subfile.filename)[-1] == os.path.splitext(filename)[-1]: # If the file extension matches
                            file_resources[filename] = zf.open(subfile.filename, mode='r')

                elif filetype_ext.extension == 'rar':
                    rf = rarfile.RarFile(data_file, 'r')

                    for subfile in rf.infolist():
                        if os.path.splitext(subfile.filename)[-1] == os.path.splitext(filename)[-1]: # If the file extension matches
                            file_resources[filename] = rf.open(subfile.filename, mode='r')
                else:
                    file_resources[filename] = data_file

        elif os.path.isdir(path) and os.path.exists(path):
            for _, filepath in file_resources.items():
                if not os.path.exists(filepath):
                    raise IOError(filepath)
        else:
>           raise IOError(path)
E           OSError: /data/datasets/Bioinformatics_ExternalData/miRTarBase/

openomics/database/base.py:113: OSError
================================================== warnings summary ==================================================
../../../.pyenv/versions/3.8.7/lib/python3.8/site-packages/_pytest/config/__init__.py:1233
  /Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/_pytest/config/__init__.py:1233: PytestConfigWarning: Unknown config option: collect_ignore

    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")

tests/test_annotations.py: 6 warnings
tests/test_disease.py: 6 warnings
tests/test_interaction.py: 4 warnings
tests/test_multiomics.py: 3 warnings
tests/test_sequences.py: 4 warnings
  /Users/stephenmoss/Dropbox/Code/OpenOmics/openomics/transcriptomics.py:108: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support sep=None with delim_whitespace=False; you can avoid this warning by specifying engine='python'.
    df = pd.read_table(data, sep=None)

tests/test_annotations.py::test_import_GTEx
tests/test_annotations.py::test_GTEx_annotate
  /Users/stephenmoss/Dropbox/Code/OpenOmics/openomics/database/annotation.py:239: FutureWarning: The default value of regex will change from True to False in a future version.
    gene_exp_medians["Name"] = gene_exp_medians["Name"].str.replace("[.].*", "")

tests/test_disease.py::test_import_HMDD
tests/test_disease.py::test_annotate_HMDD
  /Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/encodings/unicode_escape.py:26: DeprecationWarning: invalid escape sequence '\ '
    return codecs.unicode_escape_decode(input, self.errors)[0]

tests/test_disease.py::test_import_HMDD
tests/test_disease.py::test_annotate_HMDD
  /Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/encodings/unicode_escape.py:26: DeprecationWarning: invalid escape sequence '\s'
    return codecs.unicode_escape_decode(input, self.errors)[0]

tests/test_interaction.py::test_import_LncRNA2Target
  /Users/stephenmoss/Dropbox/Code/OpenOmics/openomics/database/interaction.py:476: FutureWarning: Your version of xlrd is 1.2.0. In xlrd >= 2.0, only the xls format is supported. As a result, the openpyxl engine will be used if it is installed and the engine argument is not specified. Install openpyxl instead.
    table = pd.read_excel(file_resources["lncRNA_target_from_low_throughput_experiments.xlsx"])

tests/test_interaction.py::test_import_LncRNA2Target
  /Users/stephenmoss/Dropbox/Code/OpenOmics/openomics/database/interaction.py:480: FutureWarning: The default value of regex will change from True to False in a future version.
    table["Target_official_symbol"] = table["Target_official_symbol"].str.replace("(?i)(mir)", "hsa-mir-")

-- Docs: https://docs.pytest.org/en/stable/warnings.html

---------- coverage: platform darwin, python 3.8.7-final-0 -----------
Name                                       Stmts   Miss  Cover
--------------------------------------------------------------
openomics/__init__.py                         25      7    72%
openomics/clinical.py                         57     19    67%
openomics/database/__init__.py                 7      0   100%
openomics/database/annotation.py             197    110    44%
openomics/database/base.py                   140     37    74%
openomics/database/disease.py                 61      1    98%
openomics/database/interaction.py            330    196    41%
openomics/database/ontology.py               152     93    39%
openomics/database/sequence.py               128     66    48%
openomics/genomics.py                         26      8    69%
openomics/imageomics.py                       63     47    25%
openomics/multicohorts.py                      0      0   100%
openomics/multiomics.py                      111     66    41%
openomics/proteomics.py                       13      2    85%
openomics/transcriptomics.py                 111     32    71%
openomics/utils/GTF.py                        53     53     0%
openomics/utils/__init__.py                    0      0   100%
openomics/utils/df.py                         23     11    52%
openomics/utils/io.py                         40     19    52%
openomics/utils/read_gtf.py                  107     24    78%
openomics/visualization/__init__.py            1      0   100%
openomics/visualization/heatmat.py            11      8    27%
openomics/visualization/umap.py               29     24    17%
openomics_web/__init__.py                      0      0   100%
openomics_web/app.py                          69     69     0%
openomics_web/callbacks.py                     0      0   100%
openomics_web/layouts/__init__.py              0      0   100%
openomics_web/layouts/annotation_view.py       0      0   100%
openomics_web/layouts/app_layout.py            7      7     0%
openomics_web/layouts/clinical_view.py        10     10     0%
openomics_web/layouts/control_tabs.py          5      5     0%
openomics_web/layouts/datatable_view.py       28     28     0%
openomics_web/server.py                        2      2     0%
openomics_web/utils/__init__.py                0      0   100%
openomics_web/utils/io.py                     62     62     0%
openomics_web/utils/str_utils.py              25     25     0%
setup.py                                      44     44     0%
tests/__init__.py                              0      0   100%
tests/data/__init__.py                         0      0   100%
tests/data/test_dask_dataframes.py             0      0   100%
tests/test_annotations.py                     39      0   100%
tests/test_disease.py                         34      0   100%
tests/test_interaction.py                     20      1    95%
tests/test_multiomics.py                      46      1    98%
tests/test_sequences.py                       18      1    94%
--------------------------------------------------------------
TOTAL                                       2094   1078    49%

============================================== short test summary info ===============================================
ERROR tests/test_interaction.py::test_import_MiRTarBase - OSError: /data/datasets/Bioinformatics_ExternalData/miRTa...
================================ 34 passed, 32 warnings, 1 error in 662.47s (0:11:02) ================================

The main error seemed to be a missing dataset. On further inspection of the codebase it seems that the package is supposed to download the miRTarBase data (although it appears to have the version 7 release URL hardcoded when version 8 is now available). I wondered whether this was a permissions issue with not being able to create the /data/datasets/Bioinformatics_ExternalData/miRTarBase/ path on my system. I tried with sudo python -m pytest --cov=./ tests/test_interaction.py and got the same? I tried sudo mkdir -p /data/datasets/Bioinformatics_ExternalData/miRTarBase beforehand, which returned:

mkdir: /data/datasets/Bioinformatics_ExternalData/miRTarBase: Read-only file system

This is likely due to macOS system integrity protection.

However, it seems I am also unable to resolve http://mirtarbase.mbc.nctu.edu.tw/cache/download/7.0/. I believe the URL should actually be http://mirtarbase.cuhk.edu.cn/cache/download/7.0/ (or even http://mirtarbase.cuhk.edu.cn/cache/download/8.0/)? I manually updated to the working version 7.0 release and updated the path in test_interaction.py before running the following:

mkdir -p tests/data/datasets/Bioinformatics_ExternalData/miRTarBase
sudo python -m pytest --cov=./ tests/test_interaction.py

I still received the error, so something needs looking at in more detail here.

When raising a PR the build in Travis CI failed for all python versions. There is some work needed to get these issues resolved though I didn't inspect the output in detail.

#### For packages co-submitting to JOSS

- [ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

- [ ] A short summary describing the high-level functionality of the software - [ ] Authors: A list of authors with their affiliations - [ ] A statement of need clearly stating problems the software is designed to solve and its target audience. - [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

A number of changes and bug fixes are required before I would recommend approving this package, but in general I feel it would be a great addition to pyOpenSci.

Estimated hours spent reviewing: 7


Review Comments

I would recommend looking through the Author's Guide in more detail, particularly the Tools for developers section. Using git pre-commit hooks for local development would be beneficial for both the author and any contributors and would enable the production of greater quality code. These can also be integrated with Travis CI to ensure any pull requests also meet the same requirements via automated testing against a variety of different Python versions.

I echo the comments of @ksielemann, in that the package seems very interesting and I can see it would have a broad range of applications.

Hopefully, we can work together to get the issues resolved and get this approved. Would be interested in seeing this in JOSS at some point too.

gawbul commented 3 years ago

@JonnyTran

Just in case it gets lost in the review, I submitted PRs for a couple of the issues I hit here https://github.com/BioMeCIS-Lab/OpenOmics/pull/103 and here https://github.com/BioMeCIS-Lab/OpenOmics/pull/105.

Also opened an issue for the problem with the gzip.open returning an io.TextIOWrapper object here https://github.com/BioMeCIS-Lab/OpenOmics/issues/104.

NickleDave commented 3 years ago

Thank you @gawbul for the very thorough review, really appreciate that you could get that back to us a week before the deadline, especially with other things you've been dealing with.

@JonnyTran just want to check where you are at with this. I know it might feel like a lot, and you could have other things going on.

Our guidelines suggest aiming for a two-week turnaround time after reviews are in. We definitely don't have to hold strictly to that especially if you have other obligations to deal with.

But please when you can just give me some idea of how you'll move forward. I would suggest converting reviewer comments into issues on OpenOmics and linking to them where you can. See for example issues on physcraper from their review: https://github.com/McTavishLab/physcraper/issues

JonnyTran commented 3 years ago

Thanks so much for the thorough review @gawbul! 🥰

@NickleDave I’ve been trying to go over the issues. Actually it has been difficult because of the constant rolling electricity blackouts from the snowstorm currently in my state, Texas 🥶.

I will work on making the issues on GH and fix them soon!

NickleDave commented 3 years ago

!!! I'm sorry, didn't realize you were in Texas! The situation with the power grid is insane. Hope you can stay warm and safe!

Thank you for letting us know!

JonnyTran commented 3 years ago

Hi @gawbul, I've just gone over your reviews. Thanks for the care and attention on testing this software. There were many issues that I've missed and I've created Issues for most of your comments.

I had several issues with installation as described above, though I only tested on a macOS system running Big Sur version 11.2. It appears there are issues with Python 3.9.x in particular, though I can't confirm whether this is platform specific. There were some failures in building scipy on Python 3.8.7 due to dependent libraries missing on my machine.

I'm sorry you've had troubles running OpenOmics because of issues with installing the package dependencies from requirement.txt https://github.com/BioMeCIS-Lab/OpenOmics/issues/113, and importing umap when running the SomaticMutation vignettes https://github.com/BioMeCIS-Lab/OpenOmics/issues/114.

I have not seen these errors before as I've never ran tests on Mac OS X with Python 3.9. I've primarily been developing with the Anaconda Python 3.7 environment which already comes with a llvm installation, so the pip install openomics did not show any errors. Since this is still an issue on Mac OS X, I try to use more TravisCI automated tests against a variety of different Python versions and debug these problems. I've made an issue at https://github.com/BioMeCIS-Lab/OpenOmics/issues/117.

Docstrings seem to be available throughout the codebase for all relevant functions, however, I found the documentation on Read the Docs to be lacking in comparison, particularly in providing detail of function parameters that are otherwise available in the docstrings.

Currently documentations on Read the Docs has mostly been auto-generated by Sphinx. As pointed out by both @gawbul and @lwasser , I agree that Read the Docs documentations could be better, particularly Usage guides and Vignettes. Also, it is a great suggestion to use only markdown or reStructredText (rather than both). I've opened the issue https://github.com/BioMeCIS-Lab/OpenOmics/issues/119.

Running the example under Annotate LncRNAs with GENCODE genomic annotations returns the following: AttributeError: '_io.TextIOWrapper' object has no attribute 'startswith'

This issue has been fixed at https://github.com/BioMeCIS-Lab/OpenOmics/issues/104

No performance claims were provided. On my 16" MacBook Pro with 2.3 GHz Octa-core Intel Core i9 and 32GB RAM, however, I felt the package was a little slow in loading it's dependencies on first run. Tests also took some time to complete.

Tests currently are running with a large set of genome-wide RNA's, but probably only needs to test on a subset of the data. Will work on reducing the workload to make tests faster.

There were many other issues listed at https://github.com/BioMeCIS-Lab/OpenOmics/issues. Is there a deadline to address all these? I think the issues with Read the Docs and automated tests targeting MacOS + Python3.9 can take a week to finish.

NickleDave commented 3 years ago

Hey @JonnyTran there's no strict deadline.

Our guide says "aim for one week" but with everyone already stressed out by the pandemic, and you facing even more problems because of the situation in Texas right now, I would not ask you to meet that.

Does three weeks sound do-able to you? If not we can definitely figure something else out. I just want to have something in my calendar to make sure I can keep track of reviews. Please let me know.

Your comment above seems like it covers most of the feedback from the reviews. But when you get a chance and things are a little more back to normal for you, please do also make sure you address any specific comments from @ksielemann too.

JonnyTran commented 3 years ago

Hi @NickleDave. Yes, 3 weeks is plenty of time for me to address the issues - how does Friday, March 12 sounds?

NickleDave commented 3 years ago

Hi again @JonnyTran -- sorry for not replying sooner. I saw this and then got distracted

Yes March 12 is perfect if that still works for you. Will put it in my calendar.

JonnyTran commented 3 years ago

Hi guys @NickleDave @ksielemann @gawbul. It's been awhile, but I finally managed to address most of the comments. Yay!


For @ksielemann comments:

Download of test data

I implemented a way to load the Expression data files directly from URL, see https://openomics.readthedocs.io/en/latest/usage/getting-started.html#creating-a-multi-omics-dataset

choose the directory in which the files should be downloaded?

There is now a function to do so, see https://openomics.readthedocs.io/en/latest/usage/annotate-external-databases.html#setting-the-cache-download-directory


For @gawbul comments:

Some tests are available and run as part of the Travis CI pipeline, though coverage isn't amazing and would benefit from additional work.

I've set up GitHub Actions at https://github.com/BioMeCIS-Lab/OpenOmics/actions/workflows/python-package.yml to replace Travis CI for better pricing plans. The automated test suite currently targets Mac OS and Linux for Python 3.6-3.9. A few tests are failing (due to unavailability of some FTP servers), but I believe you shouldn't have the same problems on your Mac OS + Python 3.9 anymore.

I found the documentation on Read the Docs to be lacking

I did a complete revamp of ReadTheDocs documentation site, especially vignettes and usage guide at https://openomics.readthedocs.io/en/latest/usage/getting-started.html. The structure of API references is in place, although more in-depth usage guides should be written.

The readme.md file is also edited to reflect guidelines from https://www.writethedocs.org/guide/writing/beginners-guide-to-docs/#readme


For the JOSS submission, my manuscript can be compiled from https://github.com/BioMeCIS-Lab/OpenOmics/tree/master/inst

Is there anything else I might be missing?

NickleDave commented 3 years ago

Thank you @JonnyTran I can see you've put a ton of work in to addressing the reviewer comments.
And thank you for getting back to us by March 15th.

@ksielemann and @gawbul can you please let @JonnyTran know whether you feel the changes made, outlined in the comment above, are sufficient to address revisions you suggested in your reviews?

@JonnyTran I don't think there's anything else you're missing.
I will double-check and get back to you by Wednesday at the latest

ksielemann commented 3 years ago

First of all, thank you @JonnyTran for addressing the comments above! Some of my review comments are embedded in the specific points of the Package Review form. I believe that these points were not yet addressed (or did I miss something?).

NickleDave commented 3 years ago

Thank you for your quick reply @ksielemann Yes, I see that some of your comments were embedded in the form.

I don't mean to make more work for you, but could I ask you to raise separate issues on the OpenOmics repository for any comments that you feel have not yet been addressed?

This is the usual approach that JOSS reviews use (to raise issues with details on the repo, and then link to them / summarize on the "review issue"), I think to avoid situations like this. We should probably have very clear instructions suggesting the same approach in our guidelines--our fault.

@gawbul I think that you did raise issues for some of your comments. Can you please also check whether there are any that remain to be addressed?

ksielemann commented 3 years ago

I am sorry that I embedded the comments so that they can be easily overlooked! I now opened a few issues with my comments.

NickleDave commented 3 years ago

No need to apologize @ksielemann , definitely my fault for not being clearer about process. Thank you for taking time to open issues. That will help.

Sorry @JonnyTran for adding more to your plate. I just want to make sure it's very clear what review criteria have been met, according to reviewers.

If we need to, we can discuss further here, and you can link to specific issues.

I will check back by Friday at the latest. Again, as far as JOSS goes, if the manuscript compiles and you have a DOI for the version that we approve, then I think you are good to go. I will make sure of that when we reach that point.

ksielemann commented 3 years ago

I just closed my last issue with this comment: 'I think it is really important for future users that the usage guide works without errors. Otherwise, the user might get frustrated and refrains from using the library. So the usage guide should be updated according to the functionalities and current version of the package. But I think this can also happen while the package is further developed.'

So from my side, all my comments are sufficiently addressed now!

NickleDave commented 3 years ago

Great thank you @ksielemann glad to hear it.
Looks like the additional issues were all easily addressed or fixed already. Sorry, I didn't mean to ask you to do extra work, just wanted to make sure we were all on the same page about requested revisions. Thank you again 🙏

NickleDave commented 3 years ago

@gawbul just want to check back -- can you please confirm whether your comments have been addressed?

Looks like the corresponding issues were: https://github.com/BioMeCIS-Lab/OpenOmics/issues/119 https://github.com/BioMeCIS-Lab/OpenOmics/issues/115 https://github.com/BioMeCIS-Lab/OpenOmics/issues/114 https://github.com/BioMeCIS-Lab/OpenOmics/issues/113

gawbul commented 3 years ago

@NickleDave @JonnyTran I'll look at this asap. I've not been well this last while and am trying to recover. I won't forget 👍

NickleDave commented 3 years ago

thank you @gawbul for letting us know! sorry, somehow missed that you replied here

gawbul commented 3 years ago

No problem 😄 I'll get around to checking this one night this week 👍

gawbul commented 3 years ago

All looks good 👍

I made a quick comment here regarding an issue with the docs https://github.com/BioMeCIS-Lab/OpenOmics/issues/119#issuecomment-825137257, but otherwise, I'm happy everything has been addressed ☺️

NickleDave commented 3 years ago

Excellent, thank you so much @gawbul and @ksielemann for your very thorough reviews


🎉 openomics has been approved by pyOpenSci! Thank you @JonnyTran for submitting

There are a few things left to do to wrap up this submission:

Since this package is going to move on to JOSS for review, you'll also want to do the following:

All -- if you have any feedback for us about the review process please feel free to share it here. We are always looking to improve our process and our documentation in the contributing-guide. We have also been updating our documentation to improve the process so all feedback is appreciated!

lwasser commented 3 years ago

@NickleDave will you kindly fill out the very top of this submission with the reviewers, review version accepted etc - the very first comment? we want to ensure that we keep track of that for every review. once all is filled out and the review is complete (boxes checked above!) we can close the issue! thank you all!!

NickleDave commented 3 years ago

Thank you for taking care of those final to-dos @JonnyTran

Just checking, are you about to submit to JOSS? Based on your last couple of commits I'd guess yes. Please do let us know and/or reference this issue on the JOSS review when you do.

@lwasser I have edited the first comment to reflect those changes -- not sure if there's more I need to resolve what I was assigned through GitHub. I will close this once the JOSS review is initiated

@ksielemann @gawbul would you be okay with me adding you as contributors to the pyOpenSci site? I can request your review on the PR when I do so

JonnyTran commented 3 years ago

Hi @NickleDave,

Yes I've submitted to JOSS about 4 days ago, mentioning this was approved by pyOpenSci. I will reference this issue once the review process starts on JOSS's github.

Thanks for the updates!

NickleDave commented 3 years ago

Ah great -- I didn't realize how the process worked, I looked for an issue on their repo but didn't see it.

Thank you for letting me know! Please just let me know if you need anything from us for review at JOSS.

Going to go ahead and close. Yay!!! congrats on completing review and officially becoming part of pyOpenSci! Will tweet about openomics later! Thank you again @ksielemann and @gawbul for your great reviews

JonnyTran commented 3 years ago

Thanks so much for everyone's patience, help and support, @NickleDave, @lwasser, @gawbul, and @ksielemann !

Most of all thanks for making this a better software! I will work on making it more usable for all.

lwasser commented 3 years ago

oh all - So the JOSS review process is simple in that they accept our review as theirs! @arfon please note that this package was submitted to the JOSS review process. Can we please fast track it given we have reviewed here on the pyopensci side of things. Please let us know what you need. @JonnyTran is there an open issue in JOSS right now? can you kindly reference it here if you haven't already.

I normally keep these reviews open here until the JOSS part is finished. When it is they will ask you to add the JOSS badge on your readme as well. thank you all for this!

JonnyTran commented 3 years ago

Hi @lwasser , there isn't a JOSS issue yet, but I will tag this pyOpenSci issue once it is opened.

lwasser commented 3 years ago

ahh ok perfect. Normally they just accept through our issue! let's wait for arfon to get back to us here to ensure that is still the best process! congratulations on being a part of the pyopensci ecosystem and thank you for your submission here!

ksielemann commented 3 years ago

@ksielemann @gawbul would you be okay with me adding you as contributors to the pyOpenSci site? I can request your review on the PR when I do so

Sure, I am okay with this! Should I add myself to the file you mentioned above or do you prefer to add me to the contributor site yourself?

arfon commented 3 years ago

Sorry for the delay folks. Things are now moving in https://github.com/openjournals/joss-reviews/issues/3249