jonchang / pump

GNU General Public License v2.0
0 stars 1 forks source link

Removing phlawd_db_maker dependency #2

Open jonchang opened 3 years ago

jonchang commented 3 years ago

We should aim to create an R script that does the work of https://github.com/jonchang/phlawd_db_maker, in the sense of creating a phlawd SQLite database. The current phlawd_db_maker software is insufficient for several reasons, most importantly that It requires downloading the entirety of the gbvrt database, when we really only need actinopts.

Suggested packages to research:

The SQLite database is structured as:

CREATE TABLE taxonomy (
    id INTEGER PRIMARY KEY,    -- opaque internal ID
    ncbi_id INTEGER,           -- NCBI taxonomy ID
    name VARCHAR(255),         -- NCBI name of taxon
    name_class VARCHAR(32),    -- e.g. scientific name
    node_rank VARCHAR(32),     -- e.g. class, order, family
    parent_ncbi_id INTEGER,    -- NCBI taxonomy ID of parent taxon
    edited_name VARCHAR(255),  -- name after cleaning (removing e.g. sp. or c.f.)
    left_value INTEGER,        -- nested sets representation left value
    right_value INTEGER        -- nested sets representation right value
);

CREATE TABLE sequence (
    id INTEGER PRIMARY KEY,     -- opaque internal identifier
    ncbi_id INTEGER,            -- NCBI ID
    locus VARCHAR(128),         -- NCBI Locus
    accession_id VARCHAR(128),  -- NCBI Accession
    version_id VARCHAR(128),    -- NCBI Accession + version
    description TEXT,           -- Corresponds to NCBI "Definition"
    title TEXT,                 -- Title of paper
    seq LONGTEXT                -- Nucleotide sequences
);

CREATE INDEX sequence_ncbi_id on sequence(ncbi_id);
CREATE INDEX sequence_accession_id on sequence(accession_id);
CREATE INDEX sequence_version_id on sequence(version_id);
CREATE TABLE information (id INTEGER PRIMARY KEY, name VARCHAR(128), value VARCHAR(128));
CREATE INDEX taxonomy_ncbi_id on taxonomy(ncbi_id);
CREATE INDEX taxonomy_parent_ncbi_id on taxonomy(parent_ncbi_id);
CREATE INDEX taxonomy_name on taxonomy(name);
CREATE INDEX taxonomy_right_value on taxonomy(right_value);
CREATE INDEX taxonomy_edited_name on taxonomy(edited_name);
CREATE INDEX taxonomy_left_value on taxonomy(left_value);

We could have a series of scripts in the R/ directory to

jonchang commented 3 years ago

Genes we want: https://fishtreeoflife.org/methods/#tree-searching

Some gene name synonyms ``` 4c4 tmo-4c4 4c4 tmo4c4 coi co1 coi cox1 coi coxi cytb cyb cytb cyt-b cytb cytochrome b ficd fic domain glyt AGO61 glyt gtdc2 myh6 myh nd2 mt-nd2 nd2 mtnd2 nd2 nadh2 nd2 nad2 nd2 nadh dehydrogenase subunit 2 nd2 nadh dehydrogenase 2 nd2 nadh subunit 2 nd2 nd2 nd4 mt-nd4 nd4 mtnd4 nd4 nadh4 nd4 nad4 nd4 nadh dehydrogenase subunit 4 nd4 nadh dehydrogenase 4 nd4 nadh subunit 4 nd4 nd4 plagl2 plag ptr ptchd1 rhodopsin rh rhodopsin rho ripk4 pkk sh3px3 sh3 sh3px3 snx30 sh3px3 snx33 sreb2 gpr85 sreb2 sreb tbr1 tbr zic1 zic ```
jonchang commented 2 years ago

Create a SQLite table for taxonomy information from fishtree

Name                   Rank     Left    Right
Labridae               family   1        8
Acantholabrus          genus    2        7
Acantholabrus palloni  species  3        4
Acantholabrus alfaroi  species  5        6