neuropoly / data-management

Repo that deals with datalad aspects for internal use
4 stars 0 forks source link

uk-biobank: invalid .tsv #46

Open kousu opened 3 years ago

kousu commented 3 years ago

The participants index is incorrectly formatted. It has blanks where the BIDS standard wants the string "n/a":

nguenther@data:~/datasets/uk-biobank$ /usr/local/bin/bids-validator  .
bids-validator@1.6.1

    1: [ERR] All rows must have the same number of columns as there are headers. (code: 22 - TSV_EQUAL_ROWS)
        ./participants.tsv
            @ line: 2
            Evidence: row 1: sub-10000xx    X   99  99999       99999

    Please visit https://neurostars.org/search?q=TSV_EQUAL_ROWS for existing conversations about this issue.

    2: [ERR] Empty cell in TSV file detected: The proper way of labeling missing values is "n/a". (code: 23 - TSV_EMPTY_CELL)
        ./participants.tsv
            @ line: 2
            Evidence: row 1: sub-10000xx    X   99  99999       99999

    Please visit https://neurostars.org/search?q=TSV_EMPTY_CELL for existing conversations about this issue.

    1: [WARN] The Authors field of dataset_description.json should contain an array of fields - with one author per field. This was triggered based on the presence of only one author field. Please ignore if all contributors are already properly listed. (code: 102 - TOO_FEW_AUTHORS)

    Please visit https://neurostars.org/search?q=TOO_FEW_AUTHORS for existing conversations about this issue.

        Summary:                  Available Tasks:        Available Modalities: 
        1404 Files, 9.75GB                                T1w                   
        350 - Subjects                                    T2w                   
        1 - Session                                                             

    If you have any questions, please post on https://neurostars.org/tags/bids.
kousu commented 3 years ago

I can fix this with

awk '
  BEGIN {
    OFS=FS="\t";
    ORS=RS="\r\n";
  }

{ for(i=1; i<=7; i++) {
    if($i=="") {
      $i = "n/a"
    }
  };

print }' participants.tsv  

(some subtleties: it's TSV so FS="\t", it's using dos line endings so RS="\r\n", and I manually counted the number of fields there should be instead of using NF)

kousu commented 3 years ago

Oh actually, this TSV file is a mix of dos and unix line endings, and the last line is missing a line ending entirely. https://bids-specification.readthedocs.io/en/stable/02-common-principles.html#tabular-files doesn't say anything about line endings. I'm going to force them all to unix format:

cat participants.tsv | (tr -d "\r";) | awk '
  BEGIN {
    OFS=FS="\t";
  }

{ for(i=1; i<=7; i++) {
    if($i=="") {
      $i = "n/a"
    }
  };

print }' > p
mv p participants.tsv
git add participants.tsv
git commit

This fix is on ng/bids-validate, can you take a look at it, @alexfoias ?

It is many thousands of lines to vet. Maybe you can write a python script that loads both TSV files and compares values to make sure I didn't delete one by accident?