Partial results for long text blobs

chen-bowen commented 1 year ago

Hi Kevin,

Such a great repo, thank you so much for the work. I'm wondering for this generative model, is there a parameter I can set to generate for longer sequences rather one sentence? I found the model is giving partial answers when I'm giving it a paragraph. I tried splitting the paragraph into sentences, but it's way too slow for an API standard. Do you have a better idea on how to do this?

Many thanks, Bowen

kevinscaria commented 1 year ago

Hi Bowen, I am glad that our work has helped your research/work. Regarding the base Tk-instruct model (T5 architecture), the input and output token length is truncated at 512 tokens by default. So I am not sure if you can pass longer sequences. Let me check if a LongT5 architecture is compatible with the current model weights. I will get back to you with this. It would be great if you can share a sample input sequence.

Best, Kevin

dannhh commented 1 year ago

Here i have a similar problem, even in short text blob. The output of joint task seem to be truncated to approx 10 tokens

Example 1:

Input: "Everything is clean, the bed is very comfortable. There is no electric fan, only air conditioning, so it is a bit inconvenient."
Expected output: "cleanliness of the hotel:positive, comfort of amenities:positive,design and features of amenities:negative"
I received: "cleanliness of the hotel:positive, comfort of amenities"

Example 2:

Input: "Spacious room. Close to the sea. No fan installed. Electricity is a bit dark."
Expected output: "location of the hotel:positive,design and features of the room:positive,design and features of the amenities:negative"
I received: "location of the hotel:positive,design and features of the room:"

And the remain outputs seems also have this approx length (about 10 token). My work is in vietnamese and i try to convert it to the same results in english. Hope the above examples help you to get more information about our problems.

Many thanks for your works, Dan

Update: When generating the tokens with the model, the max_length parameter had to be added, as below: I change self.model.generate(batch) -> self.model.generate(batch, max_length=128) As a result, the model was no longer truncating sentences.

chen-bowen commented 1 year ago

@dannhh Perfect that's exactly what I needed. Thank you so much guys!

kevinscaria commented 1 year ago

Great, I will add an argument to vary the max_length stuff.

kevinscaria / InstructABSA

Partial results for long text blobs #6