Scrub db of non-UTF-8 characters

Meta-analyses will become increasingly common, but there are studies with non UTF-8 characters that can cause silent failures when creating the analysis artifact. We need to scrub these from the db

Example: https://qiita.ucsd.edu/analysis/description/15286/ the error is that one of the studies selected can't be parsed cause of UTF-8 errors:

ValueError: There are invalid (non UTF-8) characters in your information file. The offending fields and their location (row, column) are listed below, invalid characters are represented using 🐾: "Universita🐾🐾 degli Studi di Napoli Federico II" = (1, 16), (2, 16), (3, 16), (4, 16), (5, 16), (6, 16), (7, 16), (8, 16), (9, 16), (10, 16), (11, 16), (12, 16), (13, 16), (14, 16), (15, 16), (16, 16), (17, 16), (18, 16), (19, 16), (20, 16), (21, 16), (22, 16), (23, 16), (24, 16), (25, 16), (26, 16), (27, 16), (29, 16), (30, 16), (31, 16), (32, 16), (33, 16), (34, 16), (35, 16), (36, 16), (37, 16), (38, 16), (39, 16), (40, 16), (41, 16), (42, 16), (43, 16), (44, 16), (45, 16), (46, 16), (47, 16), (48, 16), (49, 16), (50, 16), (51, 16), (52, 16), (53, 16), (54, 16), (55, 16), (56, 16), (57, 16), (58, 16), (59, 16), (60, 16), (61, 16), (62, 16), (63, 16), (64, 16), (65, 16), (66, 16), (67, 16), (68, 16), (69, 16), (70, 16), (71, 16), (72, 16), (73, 16), (74, 16), (75, 16), (76, 16), (77, 16), (78, 16), (79, 16), (80, 16), (81, 16), (82, 16), (83, 16), (84, 16), (85, 16), (86, 16), (87, 16), (88, 16), (89, 16), (90, 16), (91, 16), (92, 16), (93, 16), (94, 16), (95, 16), (96, 16)

This has been solved during the last release (April 9th). Note that adding this problematic characters is not allowed via Qiita but we had some older studies that had this issue. Thus to scrub them we needed to find them first. We used this code:

from qiita_db.study import Study

studies = list(Study.get_by_status('private').union(
    Study.get_by_status('public')).union(Study.get_by_status('sandbox')))

to_review = []
for study in studies:
    if study.sample_template is None:
        continue
    to_fix = set(
        [(cat, v) for cat in study.sample_template.categories() 
         for v in study.sample_template.get_category(cat).values() 
         if v is not None
         for c in v if ord(c) < 32 or ord(c) > 126])
    to_fix_prep = []
    for pt in study.prep_templates():
        to_fix_pt = set(
            [(cat, v) for cat in pt.categories() 
             for v in pt.get_category(cat).values() 
             if v is not None
             for c in v if ord(c) < 32 or ord(c) > 126])
        if to_fix_pt:
            to_fix_prep.append((pt, pt.id, to_fix_pt))
    if to_fix or to_fix_prep:
        to_review.append((study, study.id, to_fix, to_fix_prep))

for study, sid, cst, cpt in to_review:
    for col, val in cst:
       print  ("qiita.sample_%d" % sid, col, val)
    for pt, pid, to_fix_pt in cpt:
        for col, val in to_fix_pt:
            print  ("qiita.prep_%d" % pid, col, val)

Once we had that, we can solve via a combination of SQL and python to regenerate those sample and prep info files. SQL:

UPDATE qiita.prep_19 SET library_construction_protocol = 'EMP V4 515f,806rbc protocol';
UPDATE qiita.prep_31 SET sample_center = 'Universita degli Studi di Napoli Federico II';
UPDATE qiita.prep_3156 SET instrument_model = 'Illumina MiSeq';
ALTER TABLE qiita.sample_1883 DROP COLUMN samp_collection_method;
UPDATE qiita.sample_10764 SET cheese_name = 'Bonde Du Poitou Herve Mons' WHERE cheese_name = 'Bonde Du Poitou Herv̩ Mons';
UPDATE qiita.sample_10868 SET cheese_name = 'Bonde Du Poitou Herve Mons' WHERE cheese_name = 'Bonde Du Poitou Herv̩ Mons';
UPDATE qiita.sample_1774 SET diet = 'B fishing, hunting, cassava-yucuta, some maize and plantane, local fruits' WHERE diet = 'B fishing, hunting, cassava-yucuta, some maize and plantane, local í_fruits';
UPDATE qiita.sample_10331 SET state = 'Stockholm County' WHERE state = 'Stockholms län';
UPDATE qiita.sample_10288 SET state = 'Stockholm County' WHERE state = 'Stockholms län';
UPDATE qiita.sample_10405 SET state = 'Stockholm County' WHERE state = 'Stockholms län';
UPDATE qiita.sample_10210 SET mk_navn = 'MA' WHERE mk_navn = 'MÃ¸';
UPDATE qiita.sample_10181 SET mk_navn = 'Mo' WHERE mk_navn = 'Mø';
UPDATE qiita.sample_10181 SET anonymized = 'Mo' WHERE anonymized = 'Mø';
UPDATE qiita.sample_10203 SET mk_navn = 'Mo' WHERE mk_navn = 'Mø';
UPDATE qiita.sample_10203 SET anonymized = 'Mo' WHERE anonymized = 'Mø';
UPDATE qiita.sample_10209 SET mk_navn = 'Mo' WHERE mk_navn = 'Mø';
UPDATE qiita.sample_10201 SET mk_navn = 'Mo' WHERE mk_navn = 'Mø';
UPDATE qiita.sample_10201 SET anonymized = 'Mo' WHERE anonymized = 'Mø';
UPDATE qiita.sample_1740 SET physical_specimen_location = 'Armacao baixa' WHERE physical_specimen_location = 'Armaç o baixa';

python:

from qiita_db.metadata_template.prep_template import PrepTemplate
from qiita_db.metadata_template.sample_template import SampleTemplate

preps = [19, 31, 3156]
samples = [1883, 10764, 10868, 1774, 10331, 10288, 10405, 10210, 10181, 10203, 10209, 10201, 1740]

for pid in preps:
    PrepTemplate(pid).generate_files()  

for sid in samples:
    SampleTemplate(sid).generate_files()

To verify that this was solved, we reran the initial code and nothing new was found.

qiita-spots / qiita

Scrub db of non-UTF-8 characters #2476