Generic Data File Parser

dwaring87 commented 3 weeks ago

Expected Behavior

Create a generic data file parser that could be used with all of the file upload functions that will support .csv, .xls, .xlsx, .txt (tab-separated) files for any upload. We've had a few people request CSV uploads due to problems/limitations with Excel.

I've started working on this on the topic/generic_file_parser branch.

There is a new CXGN::File::Parse class that can be used to parse any of the supported file types into a uniform parsed data format.

For example:

my $parser = CXGN::File::Parse->new(
    file => '/home/production/public/data.csv',
    required_columns => [ 'accession_name', 'species_name' ],
    column_aliases => {
      'accession_name' => [ 'accession', 'name' ],
      'species_name' => [ 'species' ]
    },
    column_arrays => [ 'synonym', 'organization_name' ]
);
my $parsed = $parser->parse();

my $errors = $parsed->{errors};
my $columns = $parsed->{columns};
my $data = $parsed->{data};
my $values = $parsed->{values};

will return:

errors: an array of error messages encountered during file read / parsing
- problems with opening the file (file doesn't exist, error from type-specific perl module)
- missing required columns (when the required columns are specified)
- rows with no values for required columns
columns: an array of the column headers in the file
data: an array of hashes, where each array item is one row of the input file
values: a hash of the unique values for each column
- for columns specified in the column_arrays argument, the value will be split by the delimiter (',' by default) and returned as an array

Example Input:

username	first_name	last_name	email address	organization
testing1	Test, Mȧ	Testing123	testing1@gmail.com	Cornell University


testing2	Test	Testing456	testing2@gmail.com	Cornell University
testing3	John	Testing	testing3@gmail.com	Cornell University

Example Output:

{
  "errors": [],
  "columns": [
    "username",
    "first_name",
    "last_name",
    "email address",
    "organization",
    "address",
    "country",
    "phone",
    "research_keywords",
    "research_interests",
    "webpage"
  ],
  "data": [
    {
      "address": null,
      "email address": "testing1@gmail.com",
      "first_name": "Test, Mȧ",
      "country": null,
      "organization": "Cornell University",
      "_row": 2,
      "research_keywords": null,
      "webpage": null,
      "phone": null,
      "research_interests": null,
      "last_name": "Testing123",
      "username": "testing1"
    },
    {
      "phone": null,
      "research_interests": null,
      "last_name": "Testing456",
      "username": "testing2",
      "_row": 5,
      "organization": "Cornell University",
      "webpage": null,
      "research_keywords": null,
      "country": null,
      "email address": "testing2@gmail.com",
      "address": null,
      "first_name": "Test"
    },
    {
      "first_name": "John",
      "email address": "testing3@gmail.com",
      "address": null,
      "country": null,
      "webpage": null,
      "research_keywords": null,
      "_row": 6,
      "organization": "Cornell University",
      "last_name": "Testing",
      "research_interests": null,
      "username": "testing3",
      "phone": null
    }
  ],
  "values": {
    "country": [],
    "username": [
      "testing2",
      "testing1",
      "testing3"
    ],
    "research_interests": [],
    "last_name": [
      "Testing123",
      "Testing456",
      "Testing"
    ],
    "phone": [],
    "research_keywords": [],
    "webpage": [],
    "first_name": [
      "Test, Mȧ",
      "Test",
      "John"
    ],
    "organization": [
      "Cornell University"
    ],
    "address": [],
    "email address": [
      "testing1@gmail.com",
      "testing3@gmail.com",
      "testing2@gmail.com"
    ]
  }
}

dwaring87 commented 3 weeks ago

Thoughts on whether its worthwhile to convert some/most/all of the file uploads to use this? Any suggestions on the format of the parsed data?

dwaring87 commented 3 weeks ago

File uploads that could be updated:

[x] Locations
[x] Accessions
[ ] Pedigrees
[ ] Single Field Trials
[ ] Multi Field Trials
[ ] Phenotype Observations - Detailed
[ ] Phenotype Observations - Simple
[ ] Seedlots
[ ] Seedlot Transactions
[ ] Crosses
[ ] Genotyping Plates

... and many others

solgenomics / sgn

Generic Data File Parser #4975

Expected Behavior