How to import a custom data into Prado.

Alex-Arsenal commented 4 years ago

Prerequisites Please answer the following questions for yourself before submitting an issue.

[yes ] I am using the latest TensorFlow Model Garden release and TensorFlow 2. [yes] I am reporting the issue to the correct repository. (Model Garden official or research directory) [yes] I checked to make sure that this issue has not been filed already.

The entire URL of the file you are using https://github.com/tensorflow/models/tree/master/research/sequence_projection
Describe the bug The model is not learned anything. It just put all the test data into 0.
Steps to reproduce I tried to run prado model on my own single label dataset. I used tf.data.experimental.make_csv_dataset function to import my csv file. And change the loss function from tf.nn.sparse_softmax_cross_entropy_with_logits to tf.nn.softmax_cross_entropy_with_logits and did not set the max sequence. But after training, the model learns nothing. I compared the data imput between me and the demo(tfds.load), the inside type is slightly different. But I do not know how to construct my csv file into that type. Is there any help you can give?
Expected behavior It expected to get a well trained model.
System information OS Platform and Distribution: Linux Ubuntu 16.04 TensorFlow installed from (source or binary): 2.3 TensorFlow version (use command below): 2.3 Python version: 3.6 Bazel version (if compiling from source): 3.5 Also I notice that tfds allows me to add a custom dataset into it. But the instruction is too hard to follow. I do not know how to import without download. I hope if I can figure out this, it would help me get the desire structure.

prabhukaliamoorthi commented 4 years ago

Please check seq_flow_lite/input_fn_reader.py

You can customize the create_input_fn there. This is a general tensorflow question rather than a question for PRADO model.

You can use any dataset reader like tf.data.experimental.make_csv_dataset but finally there should be 2 features. The text input (tf.string data type) to the model and the label. We call the projection operation in _post_processor which returns the ternary projection output and sequence length that is fed to the PRADO encoder in create_model in seq_flow_lite/trainer.py

You can also use tf.Print in the input function that would let you inspect the text and the labels so you can be sure that everything works.

Alex-Arsenal commented 4 years ago

Please check seq_flow_lite/input_fn_reader.py

You can customize the create_input_fn there. This is a general tensorflow question rather than a question for PRADO model.

You can use any dataset reader like tf.data.experimental.make_csv_dataset but finally there should be 2 features. The text input (tf.string data type) to the model and the label. We call the projection operation in _post_processor which returns the ternary projection output and sequence length that is fed to the PRADO encoder in create_model in seq_flow_lite/trainer.py

You can also use tf.Print in the input function that would let you inspect the text and the labels so you can be sure that everything works.

Yes, I used tf.data.experimental.make_csv_dataset and give the file_path, batch_size, field_delim, use_quote_delim into this function. And I finally got a PrefetchDataset with OrderedDict. But this type cannot go through the random_substr function. So I just skip this and feed into the trainer. But the result I got is nothing useful. It just give every result 0 label, instead of classify them.

Besides the input data is <PrefetchDataset shapes: OrderedDict([(uid,(None,)), (News_title_text, (None,)), (segment,(None,)), (tag_nlp, (None,))]), types: OrderedDict([(uid, tf.string), (news_title_text, tf.string), (segement, tf.string), (tag_nlp, tf.int32)])> is this correct?

prabhukaliamoorthi commented 4 years ago

random_substr should just process text input. It selects a random contiguous set of words if the number of words in the sentence is more than max sequence length (a constant). I look at the loss, does it go down? The model is constructed with features dict which is a dictionary of tensors. The model just takes text input (tf.string) and label (tf.int32 or tf.float32). You should correctly setup the model/input/loss/metric to correctly train the model and observe its performance. What is provided is just a demo code, you will need to customize it for your problem using many tensorflwo call in it already.

Alex-Arsenal commented 4 years ago

Yes. The loss is goes down. But when I evaluate my example, the eval_loss is 0 every time. And the auc is lower than 0.7. I changed the steps to check the loss while training. It goes down obviously. So I do not know what happened in here. I just suppose that the input data type is not correct. By the way, I am a new hand. So any help will be really appreciate.

prabhukaliamoorthi commented 4 years ago

It is mysterious, I assume something wrong is going on with the eval input function. Different input function are created for train and eval.

Alex-Arsenal commented 4 years ago

I do not know what happened. Even if I use the training data to eval my model, the eval_loss is still stay 0. I checked the data I input. The only thing different is the demo is a string as a value in the dict, but while using tf.data.experimental.make_csv_dataset it gives a list as a value. I am not sure whether this would affect the data reading.

By the way, I tried to use the newer one to check if it was solved. But the bazel configuration might be wrong. It shows the Build file in the model floder. The third line it cannot find //:friends.

prabhukaliamoorthi commented 4 years ago

We'll fix the issue with the BUILD file today.

Your debugging method is right, if you pursue that direction you will find the root cause.

Alex-Arsenal commented 4 years ago

Thanks, but how to change the data into the correct type? I tried different ways. But none of them are worked. So if you have any advice, I hope you could let me know.

Thanks.

tensorflow / models

How to import a custom data into Prado. #9449