Open Guzzii opened 6 years ago
@Guzzii This is something we've run into internally as well. The current work around is to set short_names = True
, which will get you to hundreds, but probably not thousands of inputs.
What if encoders that share a common base name, followed by a number, e.g. 'sequence_1', 'sequence_2', 'sequence_3', ... 'sequence_n'
were mapped into a single input of 'sequence'
with shape(n), for all types where that is possible?
Hi montanalow. I think it makes sense. Just want to make sure if I understand correctly. In this case, it would aggregate columns with shared base name sequence_col_{}
, and encoder generated input like one_hot_{}
, respectively.
sequence_col_{} -> sequence (input_shape=n_1)
one_hot_{} -> one_hot (input_shape=n_2)
Correct. I think there will be a little bit of complexity around encoders that have a sequence_length
like the Token
encoder, because they will need to go to a 2D shaped input, but should still work in theory.
Fraud inside this system
Hi @montanalow . This is really a great work. I really like how you abstract the common pitfalls in machine learning and streamline the process in this project. I see a lot of potential in this project from a data scientist perspective. If you don't mind, I can provide my feedback from using this tool.
For this particular issue, I encountered
h5py
error because of too manyInput
layers. As show here, we have to pass one encoder for each column in the dataframe, and each encoder corresponds to oneInput
layer. I deal with a lot of DNA sequence data which is usually >5000 columns. I think it makes sense to at least combine the columns usingContinuous
orPass
encoders into oneInput
.