Closed MRuecklCC closed 7 months ago
Theoretically it would also be possible to generate a description from e.g. a well chosen set of keywords (i.e. the other way round). However, the ML model would then simply generates a text is somewhat related to the keywords:
Write a short text that summarized the interaction between the following keywords:
clouds, sun, carbon dioxide, ocean currents, explanation
The sun's heat evaporates water from the ocean, which then rises and forms clouds. The clouds reflect some of the sun's heat back into space, but they also trap some of the heat near the Earth. This trapped heat warms the air and land. Some of this trapped heat is then absorbed by the ocean, which affects the ocean currents. The ocean currents then help to redistribute the heat around the globe.
While technically correct, such a description would probably often miss the point of the actual content - imagine the respective content was a simulation or animation, the generated description would not cover that.
Another very desirable feature would be to extract the description from scraped full text document (e.g. html or docx). It would be awesome if we could just dump in the whole HTML body as input and get out summary that ignores all the technical HTML, header, footer, scripts, forms, links, etc.
Unfortunately the OpenAI models seems to be not suited for that, as they only support inputs up to X tokens, and usually the HTML content size exceeds this limit.
There are a couple of quite powerful pre-trained NLP models available for public use.
I ran a couple of trials with OpenAI's gpt3-instruct model. This issue serves to gather the results and examples.
Single indent: text input, double indent: generated output.
Summarization to a title
Used tokens: ~150-300 :arrow_right: ~ $0.01-$0.02 per query.
Summarize Title #1
Summarize Title #2
Summarize Title #3
Summarize Title #4