polisquad / mpi-project

Implementation of the K-means algorithm with OpenMP and MPI
GNU General Public License v3.0
0 stars 1 forks source link

Implement reader for .data / .csv files #4

Open kjossul opened 5 years ago

kjossul commented 5 years ago

It would be nice to test against a real dataset instead of random generated points. Most of them have .data or .csv extensions. A nice list can be found here. EDIT: a list of datasets for clustering is here!

sneppy commented 5 years ago

I don't know about .data format, but a .csv file is just lines of comma separated values:

FILE * fp = fopen(filename, "r");
Array<T> out;
char buffer[256];
T outvar;
while (fgets(buffer, 256, fp))
    sscanf(buffer, format, &outvar),
    out.push(outvar);
fclose(fp)
kjossul commented 5 years ago

.data files are similar as well. From what I've seen, most datasets have a class in the list of attributes (see iris dataset), meaning that we can use just the other attributes to compute the clustering and we can use the class attribute to check algorithm accuracy and performance.