solgenomics / sgn

The code behind the Sol Genomics Network, Cassavabase and other Breedbase websites
https://solgenomics.net
MIT License
67 stars 35 forks source link

Generic Data File Parser #4975

Open dwaring87 opened 3 weeks ago

dwaring87 commented 3 weeks ago

Expected Behavior

Create a generic data file parser that could be used with all of the file upload functions that will support .csv, .xls, .xlsx, .txt (tab-separated) files for any upload. We've had a few people request CSV uploads due to problems/limitations with Excel.

I've started working on this on the topic/generic_file_parser branch.

There is a new CXGN::File::Parse class that can be used to parse any of the supported file types into a uniform parsed data format.

For example:

my $parser = CXGN::File::Parse->new(
    file => '/home/production/public/data.csv',
    required_columns => [ 'accession_name', 'species_name' ],
    column_aliases => {
      'accession_name' => [ 'accession', 'name' ],
      'species_name' => [ 'species' ]
    },
    column_arrays => [ 'synonym', 'organization_name' ]
);
my $parsed = $parser->parse();

my $errors = $parsed->{errors};
my $columns = $parsed->{columns};
my $data = $parsed->{data};
my $values = $parsed->{values};

will return:

Example Input:

username first_name last_name email address organization address country phone research_keywords research_interests webpage
testing1 Test, Mȧ Testing123 testing1@gmail.com Cornell University
testing2 Test Testing456 testing2@gmail.com Cornell University
testing3 John Testing testing3@gmail.com Cornell University

Example Output:

{
  "errors": [],
  "columns": [
    "username",
    "first_name",
    "last_name",
    "email address",
    "organization",
    "address",
    "country",
    "phone",
    "research_keywords",
    "research_interests",
    "webpage"
  ],
  "data": [
    {
      "address": null,
      "email address": "testing1@gmail.com",
      "first_name": "Test, Mȧ",
      "country": null,
      "organization": "Cornell University",
      "_row": 2,
      "research_keywords": null,
      "webpage": null,
      "phone": null,
      "research_interests": null,
      "last_name": "Testing123",
      "username": "testing1"
    },
    {
      "phone": null,
      "research_interests": null,
      "last_name": "Testing456",
      "username": "testing2",
      "_row": 5,
      "organization": "Cornell University",
      "webpage": null,
      "research_keywords": null,
      "country": null,
      "email address": "testing2@gmail.com",
      "address": null,
      "first_name": "Test"
    },
    {
      "first_name": "John",
      "email address": "testing3@gmail.com",
      "address": null,
      "country": null,
      "webpage": null,
      "research_keywords": null,
      "_row": 6,
      "organization": "Cornell University",
      "last_name": "Testing",
      "research_interests": null,
      "username": "testing3",
      "phone": null
    }
  ],
  "values": {
    "country": [],
    "username": [
      "testing2",
      "testing1",
      "testing3"
    ],
    "research_interests": [],
    "last_name": [
      "Testing123",
      "Testing456",
      "Testing"
    ],
    "phone": [],
    "research_keywords": [],
    "webpage": [],
    "first_name": [
      "Test, Mȧ",
      "Test",
      "John"
    ],
    "organization": [
      "Cornell University"
    ],
    "address": [],
    "email address": [
      "testing1@gmail.com",
      "testing3@gmail.com",
      "testing2@gmail.com"
    ]
  }
}
dwaring87 commented 3 weeks ago

Thoughts on whether its worthwhile to convert some/most/all of the file uploads to use this? Any suggestions on the format of the parsed data?

dwaring87 commented 3 weeks ago

File uploads that could be updated:

... and many others