turbomam / biosample-xmldb-sqldb

Tools for loading NCBI Biosample into an XML database and then transforming that into a SQL database
MIT License
0 stars 1 forks source link

Include an alignment between GOLD Biosample IDs and NCBI Biosample IDs #27

Open turbomam opened 8 months ago

turbomam commented 8 months ago

Possibly available from https://gold.jgi.doe.gov/downloads, specifically the Public Studies/Biosamples/SPs/APs/Organisms Excel link, which downloads goldData.xlsx

The Sequencing Project tab within goldData.xlsx looks most promising

But it might be beneficial to try other approaches, too.

turbomam commented 8 months ago
import json
import csv

# Load JSON data
with open('all-gold-studies.json', 'r') as f:
    data = json.load(f)

print("LOADED")

# Open TSV file for writing
with open('output.tsv', 'w', newline='') as tsvfile:
    # Define TSV writer with tab as the delimiter
    writer = csv.writer(tsvfile, delimiter='\t')

    # Write header row with reordered columns
    writer.writerow(['studyGoldId', 'biosampleGoldId', 'projectGoldId', 'ncbiBioSampleAccession', 'ncbiBioProjectAccession', 'sraExperimentIds'])

    # Iterate over studies
    for study in data:
        study_gold_id = study.get('studyGoldId', '')

        # Iterate over biosamples
        biosamples = study.get('biosamples', [])
        for biosample in biosamples:
            biosample_gold_id = biosample.get('biosampleGoldId', '')

            # Iterate over projects
            projects = biosample.get('projects', [])
            for project in projects:
                project_gold_id = project.get('projectGoldId', '')
                print(f"{study_gold_id}: {biosample_gold_id} {project_gold_id}")
                ncbi_bio_project_accession = project.get('ncbiBioProjectAccession', '')
                ncbi_bio_sample_accession = project.get('ncbiBioSampleAccession', '')

                # Extract sraExperimentIds and pipe-delimit them
                sra_experiment_ids = '|'.join(project.get('sraExperimentIds', []))

                # Write row to TSV with reordered columns and sraExperimentIds
                writer.writerow([study_gold_id, biosample_gold_id, project_gold_id, ncbi_bio_sample_accession, ncbi_bio_project_accession, sra_experiment_ids])

print("TSV file generated successfully.")
turbomam commented 8 months ago

Add code for generating all-gold-studies.json

turbomam commented 8 months ago

~ 188,000 mappings, ~186,000 mappings to ncbiBioSampleAccession via JSON

determined in Excel

turbomam commented 8 months ago

columns from the Sequencing Project tab within goldData.xlsx

  1. PROJECT GOLD ID
  2. PROJECT NAME
  3. SEQUENCING STRATEGY
  4. ITS PROPOSAL ID
  5. ITS SEQUENCING PROJECT ID
  6. ITS SAMPLE ID
  7. PMO PROJECT ID
  8. NCBI BIOPROJECT ACCESSION
  9. NCBI BIOSAMPLE ACCESSION
  10. SRA EXPERIMENT IDS
  11. PROJECT GENOME PUBLICATION PUBMED ID
  12. PROJECT OTHER PUBLICATION PUBMED ID
  13. PROJECT STATUS
  14. SEQUENCING STATUS
  15. SEQUENCING CENTERS
  16. PROJECT FUNDING
  17. PROJECT LEGACY GOLD ID
  18. STUDY GOLD ID
  19. ORGANISM GOLD ID
  20. BIOSAMPLE GOLD ID
  21. PROJECT CONTACTS
  22. PROJECT JGI DATA UTILIZATION STATUS
turbomam commented 8 months ago
import pandas as pd
from sqlalchemy import create_engine

# Read the Excel file
xlsx_file = "goldData.xlsx"
sheet_name = "Sequencing Project"

# Define the output TSV file path
filtered_output_tsv_file = "sequencing_project_filtered.tsv"
intact_output_tsv_file = "sequencing_project.tsv"

# Define database connection parameters
DB_USER = 'postgres'
DB_PASSWORD = '<SECRET>'
DB_HOST = 'localhost'
DB_PORT = '15432'
DB_NAME = 'ncbi_biosamples_feb26'

# Define table name
table_name = 'gold_sequencing_project'

# , nrows=1000
df = pd.read_excel(xlsx_file, sheet_name=sheet_name)

# df = df.head(100)

# Trim off new line characters and everything after them, and replace whitespace with underscores
# Change column names to lowercase and replace whitespace with underscores
df.columns = df.columns.str.split('\n').str[0].str.lower().str.replace(' ', '_')

df.to_csv(intact_output_tsv_file, sep='\t', index=False)

# Filter rows with non-blank values in both columns
df_filtered = df.dropna(subset=["ncbi_biosample_accession", "biosample_gold_id"])

# Select only the desired columns
df_filtered = df_filtered[["ncbi_biosample_accession", "biosample_gold_id"]]

# Remove duplicate rows
df_filtered = df_filtered.drop_duplicates()

# Save the DataFrame as TSV
df_filtered.to_csv(filtered_output_tsv_file, sep='\t', index=False)

print(f"The filtered data have been saved as '{filtered_output_tsv_file}'.")

# Create SQLAlchemy engine
engine = create_engine(f'postgresql://{DB_USER}:{DB_PASSWORD}@{DB_HOST}:{DB_PORT}/{DB_NAME}')

# Insert DataFrame into PostgreSQL
df.to_sql(table_name, engine, if_exists='replace', index=False)

print(f"The data have been inserted into the '{table_name}' table in the '{DB_NAME}' database.")

Should possibly be inserting the data into Postgres chunk-wise

Might have to grant permissions on this table to other users since this was created by postgres.

turbomam commented 8 months ago

We can get 185,570 unique NCBI Biosample mappings from goldData.xlsx, without having to run through the API

turbomam commented 8 months ago
select
    COUNT(*) as distinct_count
from
    (
    select
        distinct
        "ncbi_biosample_accession",
        "biosample_gold_id"
    from
        gold_sequencing_project
    where
        "ncbi_biosample_accession" is not null
        and "biosample_gold_id" is not null
) as distinct_rows;

185,569

turbomam commented 8 months ago
select
    "project_gold_id",
    COUNT(*) as num_duplicates
from
    gold_sequencing_project
group by
    "project_gold_id"
having
    COUNT(*) > 1;

Empty result

Could do a unique/pk index on project_gold_id, but we are much more likely to search on ncbi_biosample_accession

turbomam commented 8 months ago

Do we want to load the GOLD Biosample's metadata into Postgres? That probably would have to come from the API.