Closed seanjensengrey closed 1 week ago
That's a good point! I hadn't thought of that when I first wrote the script to mine programs from GitHub.
I think that adding a short hash to end of the file name should be enough to avoid these collisions. Another option is adding the timestamp of when the program was mined, but that could be too long.
I won't have the time to work on it right now, but I'll try to get to it soon. If anyone's interested, the mining script would be a good starting point.
There are some duplicate files with alternate casing that collide on case insensitive file systems (default on OSX).
It would be nice if OSX folks and Linux folks saw the same dataset.