poloniki / quint

Transcription/Chunking/Summarization of audio content.
MIT License
54 stars 19 forks source link

lost of words when chunking text into paragraphs #3

Open km5ar opened 1 year ago

km5ar commented 1 year ago

Hi, it seems to be a loss of words/sentence when shunning text into paragraphs, any suggestion how to solve it?

poloniki commented 1 year ago

Good day! Could you please elaborate a little more, so that I can help, can you please give an example?

km5ar commented 1 year ago

Hi, I think the following code have some issues. some times, some of longer sentence after passing through following code will be filter it out, but I'm not able to identify where it went wrong.....

# Get the length of each sentence
sentece_length = [len(each) for each in sentences]
# Determine longest outlier
long = np.mean(sentece_length) + np.std(sentece_length) *2
# Determine shortest outlier
short = np.mean(sentece_length) - np.std(sentece_length) *2
# Shorten long sentences
text = ''
for each in sentences:
    if len(each) > long:
        # let's replace all the commas with dots
        comma_splitted = each.replace(',', '.')
    else:
        text+= f'{each}. '
sentences = text.split('. ')
# Now let's concatenate short ones
text = ''
for each in sentences:
    if len(each) < short:
        text+= f'{each} '
    else:
        text+= f'{each}. '
km5ar commented 1 year ago

I'm not able to identify why this happened, but a lot of those long form sentence has been missing.

For example

if you use following paragraphs

text = """
Over the course of his life, Jefferson owned more than 600 slaves. Since Jefferson's time, controversy has revolved around his relationship with Sally Hemings, a mixed-race enslaved woman and his late wife's half-sister.[13] According to 1998 DNA testing of Jefferson's and Hemings' descendants, combined with documentary and statistical evidence and oral history, Jefferson fathered at least six children with Hemings, including four that survived to adulthood.[14] Evidence suggests that Jefferson started the relationship with Hemings when they were in Paris, some time after she arrived there at the age of 14 or 15, when Jefferson was 44. By the time she returned to the United States at 16 or 17, she was pregnant.[15]

After retiring from public office, Jefferson founded the University of Virginia. He and John Adams both died on July 4, 1826, the 50th anniversary of U.S. independence. Presidential scholars and historians generally praise Jefferson's public achievements, including his advocacy of religious freedom and tolerance in Virginia, his peaceful acquisition of the Louisiana Territory from France without war or controversy, and his ambitious and successful Lewis and Clark Expedition. Some modern historians are critical of Jefferson's personal involvement with slavery. Jefferson is consistently ranked in the top ten presidents of American history."""

you will receive a output of:

Over the course of his life, Jefferson owned more than 600 slaves.. Since Jefferson's time, controversy has revolved around his relationship with Sally Hemings, a mixed-race enslaved woman and his late wife's half-sister.. [13] According to 1998 DNA testing of Jefferson's and Hemings' descendants, combined with documentary and statistical evidence and oral history, Jefferson fathered at least six children with Hemings, including four that survived to adulthood.. [14] Evidence suggests that Jefferson started the relationship with Hemings when they were in Paris, some time after she arrived there at the age of 14 or 15, when Jefferson was 44.. By the time she returned to the United States at 16 or 17, she was pregnant.. [15]
After retiring from public office, Jefferson founded the University of Virginia.. He and John Adams both died on July 4, 1826, the 50th anniversary of U.S. independence.. Some modern historians are critical of Jefferson's personal involvement with slavery.. Jefferson is consistently ranked in the top ten presidents of American history.. . 

where the following sentence has been lost

Presidential scholars and historians generally praise Jefferson's public achievements, including his advocacy of religious freedom and tolerance in Virginia, his peaceful acquisition of the Louisiana Territory from France without war or controversy, and his ambitious and successful Lewis and Clark Expedition.

km5ar commented 1 year ago

i think to add "text+= f'{comma_splitted}. ' " will fix the issue

for each in sentences:
    if len(each) > long:
        # let's replace all the commas with dots
        comma_splitted = each.replace(',', '.')
        text+= f'{comma_splitted}. '  ########

    else:
        text+= f'{each}. '