Retaining context metadata

thehapyone commented 5 months ago

Hi,

I was wondering is there a good way to retain metadata information of the original context information in the final compressed information? This could be useful for example in citing the source of the data or further references later on.

For example, assuming the original context looks like this

<doc id='16'>But to compete for the best jobs of the future, we also need to level the playing field with China and other competitors. \n\nThat’s why it is so important to pass the Bipartisan Innovation Act sitting in Congress that will make record investments in emerging technologies and American manufacturing. \n\nLet me give you one example of why it’s so important to pass it. \n\nIf you travel 20 miles east of Columbus, Ohio, you’ll find 1,000 empty acres of land.</doc>\n\n<doc id='17'>Get rid of outdated rules that stop doctors from prescribing treatments. And stop the flow of illicit drugs by working with state and local law enforcement to go after traffickers. \n\nIf you’re suffering from addiction, know you are not alone. I believe in recovery, and I celebrate the 23 million Americans in recovery. \n\nSecond, let’s take on mental health. Especially among our children, whose lives and education have been turned upside down.</doc>\n\n<doc id='18'>A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n\nAnd if we are to advance liberty and justice, we need to secure the Border and fix the immigration system.</doc>\n\n<doc id='19'>My plan will not only lower costs to give families a fair shot, it will lower the deficit. \n\nThe previous Administration not only ballooned the deficit with tax cuts for the very wealthy and corporations, it undermined the watchdogs whose job was to keep pandemic relief funds from being wasted. \n\nBut in my administration, the watchdogs have been welcomed back. \n\nWe’re going after the criminals who stole billions in relief money meant for small businesses and millions of Americans.</doc>"

The compressed context loses that doc HTML tags which makes it very hard to track the relevant context metadata for the individual contexts.

I have attempted to use some from of hash as the metadata in the context. For example:

[':#ref0#: He met the Ukrainian people. \nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. \nGroups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. \nIn this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight. :#ref0#:', ':#ref1#: And with an unwavering resolve that freedom will always triumph over tyranny. \nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \nHe met the Ukrainian people. :#ref1#:', ':#ref2#: And we remain clear-eyed. The Ukrainians are fighting back with pure courage. But the next few days weeks, months, will be hard on them.  \nPutin has unleashed violence and chaos.  But while he may make gains on the battlefield – he will pay a continuing high price over the long run. \nAnd a proud Ukrainian people, who have known 30 years  of independence, have repeatedly shown that they will not tolerate anyone who tries to take their country backwards. :#ref2#:', ':#ref3#: They keep moving.   \nAnd the costs and the threats to America and the world keep rising.   \nThat’s why the NATO Alliance was created to secure peace and stability in Europe after World War 2. \nThe United States is a member along with 29 other nations. \nIt matters. American diplomacy matters. American resolve matters. \nPutin’s latest attack on Ukraine was premeditated and unprovoked. \nHe rejected repeated efforts at diplomacy. :#ref3#:', 
...
]

and the compressed prompt retain some levels of the hash added but the behaviour is not consistent and varies a lot from the underlying model used. Here, for example I have used the "openai-community/gpt2-xl" model.

[':#ref0#: He met the Ukrainian people. From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. \nGroups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. \nIn this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight. :#ref0#:', ':#ref2#: remaineyed. The Ukrainians are pure courage. But the next few days months, will be hard them. Putin has unleashed violence and chaos.  But he may make gains on battlefield – he will pay a continuing high price over long run  proud Ukrainian people, who known 30 of independence, have that they will not tolerate who tries to take their backwards :#ref2:', ':#ref7: battle betweenocracy the moment is clearly of peace and security  a real test It’ to take time. us inspiration will people To Ukrainian a nations we stand you Putin circle with tanks, he never the and. :#ref:\nref: has. and many others, even Switzerland. \nWe are inflicting pain on Russia and supporting the people of Ukraine. Putin is now isolated from the world more than ever. \nTogether with our allies –we are right now enforcing powerful economic sanctions. :#ref9#:']

Any thoughts on this?

I'm using the "longllmlingua" as the ranking method

iofu728 commented 5 months ago

Hi @thehapyone, thanks for your support.

This is a great question. Currently, our colleague is working on implementing a feature to preserve user-specified tokens, which will soon be merged into the main branch.

For now, you can address this issue by replacing with "\n\n" and using the keep_split parameter. This will preserve all the "\n\n" separators after compression. You can manually restore the original metadata after obtaining the compressed prompt.

eav-solution commented 3 months ago

Hello, I faced with this issue. I have serveral contexts, but some of them filtered by llmlingua, how to determine which context id is to manual preverse original metadata.

microsoft / LLMLingua

Retaining context metadata #80