sanger / crawler

Store sample data from Lighthouse labs
MIT License
4 stars 3 forks source link

GPL-567 Parse files with the required minumum set of 6 columns (5 for MK) & ignore rest of data #62

Closed rl15 closed 4 years ago

rl15 commented 4 years ago

User story GPL-567 | As representative for lighthouse (Cristina A) I would like the samples in the plate map files from all 4 sites that contain additional column to the required set, to be parsed rather than rejected.

Who are the primary contacts for this story Cristina A Rich L

Acceptance criteria

The plate maps were expected to have a de facto standard header of Root Sample ID, (Viral Prep ID), RNA ID, (RNA-PCR ID), Result, Date Tested, Lab ID. Item in (brackets) were expected but are not used.

General tacit is to look only for only the required columns Root Sample ID, RNA ID, Result, Date Tested, Lab ID.

Know issue have been documented here: https://ssg-confluence.internal.sanger.ac.uk/x/OpoKBg

Known malformed file issues to date

  1. Milton Keynes Header does not contain Lab ID
  1. Cambridge-az Header contains extra columns and body contains some values for extra columns Root Sample ID, (Viral Prep ID), RNA ID, (RNA-PCR ID), Result, Date Tested, Lab ID, Ct
  1. Alderney Park

Since 23rd of July Alderney park have been including extra un-required columns and missing Lab ID Root Sample ID, Viral Prep ID, RNA ID, RNA-PCR ID, Result, Date Tested, Result Info, Sample Creation Date, Process Time

  1. Glasgow

No know issues

Additional context Option agreed in email Cristina A (Thursday, 9 July 2020 at 09:27) Subject: Re: Cambridge LH + new item, email to Rich L; Alan K; Sonia G; Emma G

emrojo commented 4 years ago

How is this being parsed now to understand is a MK sample?

Every center write files into a different folder so we can distinguish between each.

Lab ID is not set for Milton Keynes but is set in other labs.

JamesGlover commented 4 years ago

The actual issue with the current code isn't occurring upon parsing the original files, but on merging them together into a master file. If a centre adds columns to later files, these cause problems when attempting to merge with earlier files.

This bug might soon be rendered moot by other changes (Especially if we don't need the master file any more).

JamesGlover commented 4 years ago

Still working on a fix though, have tests written.

JamesGlover commented 4 years ago

When generating the master file extra columns are now ignored.

JamesGlover commented 4 years ago

Early MK samples had MK in LabID, later have the column unpopulated. Looking into how we're handling the missing header, as a quick glance at the validation suggests we should still be checking it is present.

JamesGlover commented 4 years ago

Oddly the latest MK file I found does have a LAB ID column

rl15 commented 4 years ago

Q. Any value in LH aggregate file yymmdd_hhmm_master.csv

Confirmed not being used by PAM - not used (Ariani, Wednesday, 29 July 2020 at 15:44) SSR - discussed in stand up on 30th agreed no longer required.

JamesGlover commented 4 years ago

Okay, so looks like the Mongo index will happily accept missing information for indexed fields. So going to restore the removal of Lab ID as a required field. What this means: Files containing Lab ID will be processed as normal, and lab ID will still be recorded in Mongo. Files missing Lab ID will be processed. They will be recorded in the samples table with Lab ID clearly missing. (In compass it displays as ‘No Field’) This meets the requirement to Parse with missing Lab ID and meets the general aim to be more permissive in what we accept, and to improve the visibility of missing information. This avoids any risks of auto-populating the field based on centre, which could result in us recording incorrect information if one of the centres starts sending info from multiple labs. It does mean that some reports using this information may not get immediate visibility on Lab ID, but they will have access to all the other information to begin chasing up any requirements.

JamesGlover commented 4 years ago

Hmm, does look like we have a few cases where MK files were parsed without a Lab ID (not necessarily a lab ID column) and then at a later date the same information has been parse With a lab ID. (Both versions ending up in the master file... which is odd as it implies the entries appeared in more than one file.) There are two entries in Mongo in these cases, one with a "MK" lab id, and one with "".

JamesGlover commented 4 years ago

Ah! Example file lacking Lab ID: MK_sanger_report_200722_0800.csv (And a few others) Picking an example from that file, and it is in Mongo, with a "" as LabID

JamesGlover commented 4 years ago

Hmm, so it's being loaded into the Master CSV. Which suggests that we're only validating field presence after building the master file, not before.

JamesGlover commented 4 years ago

Summary: 1) The main issues are not with file parsing, but with the creation of the master file. 2) When parsing an individual file we build up a dictionary of all columns, which means that all information in the CSV gets passed through to Mongo, known and unknown. 3) Validation checks for the presence of 'required fields' 4) However, when building a master file, the keys of this dictionary are based on the first file parsed (alphanumeric sort, so generally the one with the oldest date-stamp). If later files add new columns, it will error. Even if the added columns are known fields 5) On merging, any columns present in the first file, but missing in later files, will be left blank following the merge. 6) Finally, validation of column presence does not occur on the individual files, but rather only on the master file. Combined with step 5 this means that required columns can actually be missing in later files.

Changes: 1) Changed behaviour described in step 4. Additional columns will be silently dropped when building the master file. GPL-577 (https://github.com/sanger/crawler/issues/66) is likely to superceed this change. 2) Removed the Lab ID column from the required fields list. This will actually have pretty much no impact on existing behaviour (due to the behaviour described in points 5 and 6 above) but will mean we remain permissive on Lab ID presence.