Midterm feedback - Githubissues

miguelgfierro commented 6 months ago

[x] Remove config files like https://github.com/riwanahas/runreco/blob/first_iteration/.DS_Store
[x] The code should be added via PRs https://github.com/riwanahas/runreco/pulls, not directly to the main branch
[ ] Every time there is a PR, create a github action to execute the test, like https://github.com/miguelgfierro/project_template/actions
[x] There should be a main branch were all the code should go https://github.com/riwanahas/runreco/branches
[x] The functions should be in the libraries and then imported to notebooks: https://github.com/riwanahas/runreco/blob/first_iteration/spotify_reco/models/spotify_reco_model_1.ipynb See this in detail: https://github.com/miguelgfierro/project_template
[x] In the notebooks, all the imports should be in the first cell https://github.com/riwanahas/runreco/blob/first_iteration/spotify_reco/models/spotify_reco_model_1.ipynb
[x] The dataset should not be in the repo, you either move it to another repo or another host
[ ] For binary it is good to use AUC
[x] Try to get real from dataset. Example: https://github.com/aswintechguy/Machine-Learning-Projects/blob/master/Million%20Songs%20Dataset%20-%20Recommendation%20Engine/Million%20Songs%20Data%20-%20Recommendation%20Engine.ipynb
[ ] Create a release for every delivery point https://github.com/miguelgfierro/project_template/releases/tag/0.1.1
[x] Remove data from repo https://github.com/riwanahas/runreco/tree/first_iteration/spotify_reco/datasets use external storages or databases
[ ] Remove tokens from the repo and rotate the token https://github.com/riwanahas/runreco/blob/model-2/spotify_reco/models/access_token.txt
[ ] I don't understand the need to add to your repo the code of the million songs https://github.com/riwanahas/runreco/commit/37e0ffb1a867c3a19ff8c650756b7dec7a80a9ac this is python 2 code.

juan-yu commented 6 months ago

@miguelgfierro the reason adding the code of the million songs is for reading the .h5 file because that file has a special format, and using the getters the creator of the dataset provides is more convinient. It's python2, very old, but just need to slighty edit it and we can use. Faster than writing our own getters from scratch.

miguelgfierro commented 6 months ago

@lgljht90 have you tried h5py?

juan-yu commented 6 months ago

@lgljht90 have you tried h5py?

Yes I tried to use h5py to iterate rows, but this dataset seems to have a special structure, different from normal.h5 data, so I ended up using the creator's getters. I consulted the team using this dataset last year, and he advised it's not worth it to spend time on the structure of the million songs dataset. May I know the reason that professor you suggest creating our getters from scratch? Could it improve the performance? It reads really slow now.

Thank you!

riwanahas / runreco

Midterm feedback #29