Update mining procedure

nleroy917 commented 1 year ago

Overview

We need to update the function that mines a description from PEPs inside pephub for subsequent embedding and insertion into Qdrant. Currently, the pipeline looks inside each PEP and attempts to extract any project-level attributes. These are typically attributes that describe the data from a global perspective. See here for more info on the architecture. It would be better if we extracted all meaningful text from the PEP and used that to compute an embedding for vector search and retrieval. Example:

Bad:

>>> desc = mine_description(pep)
>>> desc
>>> "Asthma study"

Good:

>>> desc = mine_description(pep)
>>> desc
>>> "More than 200 asthma-associated genetic variants have been identified
    in genome-wide association studies (GWASs). Expression quantitative trait loci"

Design Goals

There are two main design goals for the updated pepembed description mining functionality:

It should extract large, information-rich descriptions from PEPs that include a lot of info (GEO and bedbase PEPs)
It should function the exact same for the big PEPs (GEO, SRA, bedbase) and small PEPs (normal user PEPs)

Basically, we want one function that operates for any PEP you give it, and it should be capable of accessing rich biological information at any level.

Technical details and code

Pseudo-code of the current implementation looks like this:

for attr in project_dict
    if any([key_word in attr for key_word in self.keywords]):
        desc += project_level_dict[attr] + " "

We need this to be more flexible and more intelligent. For example, if we have a project yaml/dict that looks like:

name: GSE226825
pep_version: 2.1.0
sample_table: GSE226825_PEP_raw.csv
sample_modifiers:
  append:
    sample_data_processing: Adapter sequences were trimmed by Trimmomatic (v0.39).
      Trimmed reads aligned using HISAT2 (v2.2.0) with referring hg19 genome. Aligned
      reads are sorted by samtools (v1.9)...
    sample_extract_protocol_ch1: "RNA was extracted from ...
experiment_metadata:
  series_type: Expression profiling by high throughput sequencing
  series_title: RNA sequencing of peripheral blood mononuclear cells isolated from
    Korean patients with asthma
  series_status: Public on Mar 08 2023
  series_summary: More than 200 asthma-associated genetic variants have been identified
    in genome-wide association studies (GWASs). Expression quantitative trait loci
    (eQTL) ...

We would need to extract out things like, sample_data_processing: and series_summary: since these contain so much information about the data.

This is the exact spot in the code that is mining the description. This is where the magic is happening! Nearly all else in this repo can be thought of as a convenient "glue" that keeps the pipeline going and consistent. The mine_metadata_from_dict function is the only one that needs significant changes (at least for now...)

Secrets, debugging, and Testing

There are three things that you will need for efficient development and testing.

The first is lab secrets. We are working with two databases in this package, as such, there are a handful of secrets and passwords we use to connect to those. This repo is set up to be compatible with the lab secret workflow. If you are setup properly with the lab secret workflow, then you can simply run source production.env and your environment will be populated with the correct credentials. Ask @nleroy917 or @nsheff if you need help here...

The second is debugging. I also have this repository setup to function with VSCode debugging. By hitting F5, you can launch the debugger, and you should then be able to use breakpoints to stop the code and inspect things.

The third is testing. I have a tests/ directory, but it doesn't contain anything 😅. The best way to test currently is by installing the package locally with pip install (pip install .), and then just running the cli: pepembed. You can speed things up by limiting the results from the database: pepembed -n 100.

Extras

In addition, we discussed in meeting that we should have multiple vectors for each object that is stored inside the Qdrant collection. Here is a blog post that explains how to do just that with Qdrant.

nsheff commented 1 year ago

donaldcampbelljr commented 1 year ago

This now mines common GSE attributes as well as high level items such as project_name.

pepkit / pepembed