Closed jefffohl closed 8 years ago
@jefffohl In case you're not aware, the three links have correspondence between them. With types of columns appearing, etc. So could be auto-magically checked in any CSV file.
@rcrowder Yes, I was hoping I could come up with some algorithm for determining if lines 2 and 3 are header lines and not data. I just want to make sure that it doesn't return false positives. If you have ideas about how best to do this, let me know. Thanks!
@jefffohl I'll have a look further on the weekend. First (header) can/could be used to find the chosen user separator (comma, semicolon, tab, etc.), to check 2nd 3rd and 4th lines. With I think the second/third line having restricted type keyworkds allowed to be parsed in the OPF. Just getting into Windows porting of OPF, so hope to find out the type tokens, with all columns using string types being a potential blocking issue..
@jefffohl And typically they get ignored. https://github.com/numenta/nupic/blob/master/tests/integration/nupic/opf/opf_checkpoint_test/opf_checkpoint_test.py#L246 With the second and third lines being just the separator (from what I've seen so far).
@jefffohl May not need to look further? Field meta data seems to be defined here (a Numenta engineer could confirm this is correct?).. https://github.com/numenta/nupic/blob/master/src/nupic/data/fieldmeta.py#L118
With 'specials' defined further down
@jefffohl Thanks for bringing over the open issues!
and Richard for finding the definitions. Just a note, typically you'll have a short form float,,,,float
(where the missing defaults to string
- not sure).
Another problem we found is that some fields (anomalyScore
at least) are meant to be float, but in OPF file use string type (as the first few values can be 'None', this should change and use eg -1 instead)-that is a NuPIC problem.
Because the OPF includes two extra header lines in the CSV output, the app is currently stripping these out automatically. However, we don't want to strip these out when the file is a non-OPF file. So, we either need to auto-detect if the file has more than one header line, or we need to let the user determine how many lines to strip out.