hartwigmedical / hmftools

Various algorithms for analysing genomics data
GNU General Public License v3.0
179 stars 56 forks source link

DEV-3502 Add LAMA tumor type and cohort to patient-db #460

Closed stefvanlieshout closed 9 months ago

stefvanlieshout commented 10 months ago

This one has been quite a journey, I have added and removed code many times as the existing setup was more complex than I anticipated. For details on what changes and why see the Jira ticket. Code has been tested in pilot using prod inputs (data loaded in hmfpatients_pilot). Commits are all named the same and are not very specific so ignore those and just look at the total result pls.

The changes add:

Considerations for the future:

Testing:

# Currently 7416 samples in both prod and pilot
SELECT COUNT(*) FROM hmfpatients.sample; # 7416
SELECT COUNT(*) FROM hmfpatients_pilot.sample; # 7416

# Example query to check differences in curation between prod and pilot
SELECT
    prod.sampleId AS sampleId,
    CONCAT_WS("|", prod.primaryTumorLocation, prod.primaryTumorSubLocation, prod.primaryTumorType, prod.primaryTumorSubType, prod.primaryTumorExtraDetails ) AS prod,
    CONCAT_WS("|", pilot.primaryTumorLocation, pilot.primaryTumorSubLocation, pilot.primaryTumorType, pilot.primaryTumorSubType, pilot.primaryTumorExtraDetails ) AS pilot
FROM
(SELECT sampleId, primaryTumorLocation, primaryTumorSubLocation, primaryTumorType, primaryTumorSubType, primaryTumorExtraDetails
FROM hmfpatients.sample LEFT JOIN hmfpatients.baseline ON hmfpatients.baseline.patientId = hmfpatients.sample.patientId) as prod
JOIN
(SELECT sampleId, primaryTumorLocation, primaryTumorSubLocation, primaryTumorType, primaryTumorSubType, primaryTumorExtraDetails
FROM hmfpatients_pilot.sample LEFT JOIN hmfpatients_pilot.baseline ON hmfpatients_pilot.baseline.patientId = hmfpatients_pilot.sample.patientId) as pilot
ON prod.sampleId = pilot.sampleId;