skiadas / PanthR

Statistics front-end and webserver with R connection
1 stars 2 forks source link

Need a "Read New File" dialog and smart detecting library #14

Open skiadas opened 11 years ago

skiadas commented 11 years ago

We should have a library that taking the data from a file as a string, should try to guess the format, and return the result in a dataset format, possibly with some notes on columns it could not resolve. But it should try to detect date/time columns for example.

It should also be able to notice that the file is just one long stream of data, each row of that data meant to be a separate variable.

We should attach to this issue different files that we would want detected.

Ideally this should happen client-side, unless that is not possible for some file types.

altermattw commented 11 years ago
  1. CSV: comma-separated values might be a first step.
  2. tab-separated values. If you copy an HTML table and paste it into Notepad, it shows up as tab-separated data.
  3. Excel. The new Excel files (.xlsx) are zipped archives rather than conventional spreadsheet files. Hopefully someone has already cracked this problem and we can borrow their code.
  4. Google spreadsheets? I think Google has an API for this.
  5. SPSS.
  6. Clipboard. John Fox has this capability in R Commander. Presumably, it would involve the same parsing process as from other sources but would pull the data from memory.
skiadas commented 11 years ago

Could also parse HTML tables out of an HTML page directly. New Excel files actually easier than the old ones. Old ones were in binary form, unreadable by anyone but Excel pretty much. SPSS integration would be interesting, and key. Potentially it would be nice to be able to read SPSS viewer files, and convert those into graph/report/table objects.

Charilaos Skiadas Department of Mathematics Hanover College

On Feb 21, 2013, at 11:58 AM, Bill Altermatt wrote:

  1. CSV: comma-separated values might be a first step.
  2. tab-separated values. If you copy an HTML table and paste it into Notepad, it shows up as tab-separated data.
  3. Excel. The new Excel files (.xlsx) are zipped archives rather than conventional spreadsheet files. Hopefully someone has already cracked this problem and we can borrow their code.
  4. Google spreadsheets? I think Google has an API for this.
  5. SPSS.
  6. Clipboard. John Fox has this capability in R Commander. Presumably, it would involve the same parsing process as from other sources but would pull the data from memory. — Reply to this email directly or view it on GitHub.
krantzj commented 11 years ago

This sounds like a good job for a student from CS if we could get one.
An nice contained project with a big impact.

John.

On 2/21/2013 12:50 PM, Haris Skiadas wrote:

Could also parse HTML tables out of an HTML page directly. New Excel files actually easier than the old ones. Old ones were in binary form, unreadable by anyone but Excel pretty much. SPSS integration would be interesting, and key. Potentially it would be nice to be able to read SPSS viewer files, and convert those into graph/report/table objects.

Charilaos Skiadas Department of Mathematics Hanover College

On Feb 21, 2013, at 11:58 AM, Bill Altermatt wrote:

  1. CSV: comma-separated values might be a first step.
  2. tab-separated values. If you copy an HTML table and paste it into Notepad, it shows up as tab-separated data.
  3. Excel. The new Excel files (.xlsx) are zipped archives rather than conventional spreadsheet files. Hopefully someone has already cracked this problem and we can borrow their code.
  4. Google spreadsheets? I think Google has an API for this.
  5. SPSS.
  6. Clipboard. John Fox has this capability in R Commander. Presumably, it would involve the same parsing process as from other sources but would pull the data from memory. — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/skiadas/PanthR/issues/14#issuecomment-13902703.