save_weights fails when large number of input features are present

instacart / lore

Lore makes machine learning approachable for Software Engineers and maintainable for Machine Learning Researchers

MIT License

1.55k stars 135 forks source link

save_weights fails when large number of input features are present #95

Open Guzzii opened 6 years ago

Guzzii commented 6 years ago

Hi @montanalow . This is really a great work. I really like how you abstract the common pitfalls in machine learning and streamline the process in this project. I see a lot of potential in this project from a data scientist perspective. If you don't mind, I can provide my feedback from using this tool.

For this particular issue, I encountered h5py error because of too many Input layers. As show here, we have to pass one encoder for each column in the dataframe, and each encoder corresponds to one Input layer. I deal with a lot of DNA sequence data which is usually >5000 columns. I think it makes sense to at least combine the columns using Continuous or Pass encoders into one Input.

montanalow commented 6 years ago

@Guzzii This is something we've run into internally as well. The current work around is to set short_names = True, which will get you to hundreds, but probably not thousands of inputs.

What if encoders that share a common base name, followed by a number, e.g. 'sequence_1', 'sequence_2', 'sequence_3', ... 'sequence_n' were mapped into a single input of 'sequence' with shape(n), for all types where that is possible?

Guzzii commented 6 years ago

Hi montanalow. I think it makes sense. Just want to make sure if I understand correctly. In this case, it would aggregate columns with shared base name sequence_col_{}, and encoder generated input like one_hot_{}, respectively.

sequence_col_{} -> sequence (input_shape=n_1)
one_hot_{} -> one_hot (input_shape=n_2)

montanalow commented 6 years ago

Correct. I think there will be a little bit of complexity around encoders that have a sequence_length like the Token encoder, because they will need to go to a 2D shaped input, but should still work in theory.

metatron1973 commented 2 years ago

Fraud inside this system