index on the h5 files - Githubissues

usnistgov / nestor-tmp2

Quantifying tacit knowledge for investigatory analysis

Other

9 stars 5 forks source link

index on the h5 files #19

Closed saschaMoccozet closed 6 years ago

saschaMoccozet commented 6 years ago

when opening the h5 files with the key "df", the dataframe contains an extra column "Unnamed: 0" which represent the indexe of the MWO.

rtbs-dev commented 6 years ago

Seems like this is a strange issue from the way pandas distinguishes "data-columns" in the HDFStore. Potentially using data_columns=df.columns.tolist()?

I can try it quick.

rtbs-dev commented 6 years ago

On another note, documentation is not great for this yet, but the 'tags' and 'rels' keys on the hdfstore are not affected by this issue.

rtbs-dev commented 6 years ago

Got it. Wasn't a problem with the HDFStore, it was an issue with the way the mainwindow was reading in the selected csv file. self.dataframe_Original needs to be read in with index_col=0 (mainwindow.py:line 134, 138)

Alternatively, we could just assume no incoming csv data has an index col. I'll re-work the anonymization pipeline we've been using, if so.

saschaMoccozet commented 6 years ago

but if the original CSV file uploaded by the user did not have any index, it will not works

rtbs-dev commented 6 years ago

right (see edit to my comment). So, I guess we should just assume every column supplied is "useful", and we just try to not write excess index columns if possible (the MWO_anon file I've been working with has this issue, for example)

rtbs-dev commented 6 years ago

IOW this wasn't a problem with the app, it was a problem with the input file. Closing, at least for now, until perhaps another way to deal with column input is worked out.

saschaMoccozet commented 6 years ago

ok no problem. I'm still wondering that the relation between the index of the MWO and the index on the tag binnary might create a problem if it link these information using the excess index columns ...