simnalamburt / rust-pragmatic-segmenter

🗣️ Rust port of pySBD and pragmatic-segmenter
Other
11 stars 0 forks source link

Big bug: a majority of the text is lost #8

Open adri1wald opened 1 year ago

adri1wald commented 1 year ago
use pragmatic_segmenter::Segmenter;

fn main() {
    let segmenter = match Segmenter::new() {
        Ok(segmenter) => segmenter,
        Err(err) => panic!("Error creating segmenter: {}", err),
    };
    let text = "In Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. It is sometimes used to refer to a group of four plays (a tetralogy), but some sources and scholars use the term to refer to eight plays. In the 19th century, Algernon Charles Swinburne used the term to refer to three plays, but that use is not current. In one sense, Henriad refers to: Richard II; Henry IV, Part 1; Henry IV, Part 2; and Henry V — with the implication that these four plays are Shakespeare's epic, and that Prince Harry, who later becomes Henry V, is the epic hero. (This group may also be referred to as the \"second tetralogy\" or \"second Henriad\".)[1][2] In a more inclusive meaning, Henriad refers to eight plays: the tetralogy mentioned above (Richard II, Henry IV, Part 1, Henry IV, Part 2, and Henry V), plus four plays that were written earlier, and are based on the civil wars now known as The Wars of the Roses — Henry VI, Part 1, Henry VI, Part 2, Henry VI, Part 3 and Richard III. In Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. It is sometimes used to refer to a group of four plays (a tetralogy), but some sources and scholars use the term to refer to eight plays. In the 19th century, Algernon Charles Swinburne used the term to refer to three plays, but that use is not current. In one sense, Henriad refers to: Richard II; Henry IV, Part 1; Henry IV, Part 2; and Henry V — with the implication that these four plays are Shakespeare's epic, and that Prince Harry, who later becomes Henry V, is the epic hero. (This group may also be referred to as the \"second tetralogy\" or \"second Henriad\".) In a more inclusive meaning, Henriad refers to eight plays: the tetralogy mentioned above (Richard II, Henry IV, Part 1, Henry IV, Part 2, and Henry V), plus four plays that were written earlier, and are based on the civil wars now known as The Wars of the Roses — Henry VI, Part 1, Henry VI, Part 2, Henry VI, Part 3 and Richard III. In Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. It is sometimes used to refer to a group of four plays (a tetralogy), but some sources and scholars use the term to refer to eight plays. In the 19th century, Algernon Charles Swinburne used the term to refer to three plays, but that use is not current. In one sense, Henriad refers to: Richard II; Henry IV, Part 1; Henry IV, Part 2; and Henry V — with the implication that these four plays are Shakespeare's epic, and that Prince Harry, who later becomes Henry V, is the epic hero. (This group may also be referred to as the \"second tetralogy\" or \"second Henriad\".)[1][2] In a more inclusive meaning, Henriad refers to eight plays: the tetralogy mentioned above (Richard II, Henry IV, Part 1, Henry IV, Part 2, and Henry V), plus four plays that were written earlier, and are based on the civil wars now known as The Wars of the Roses — Henry VI, Part 1, Henry VI, Part 2, Henry VI, Part 3 and Richard III.[3] In Shakespearean scholarship, Henriad refers to a group of Wil Henriad refers to a group of Wil Henriad refers to a group of Wil Henriad refers to a group of Wil Henriad refers\nIn Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. It is sometimes used to refer to a group of four plays (a tetralogy), but some sources and scholars use the term to refer to eight plays. In the 19th century, Algernon Charles Swinburne used the term to refer to three plays, but that use is not current. In one sense, Henriad refers to: Richard II; Henry IV, Part 1; Henry IV, Part 2; and Henry V — with the implication that these four plays are Shakespeare's epic, and that Prince Harry, who later becomes Henry V, is the epic hero. (This group may also be referred to as the \"second tetralogy\" or \"second Henriad\".)[1][2] In a more inclusive meaning, Henriad refers to eight plays: the tetralogy mentioned above (Richard II, Henry IV, Part 1, Henry IV, Part 2, and Henry V), plus four plays that were written earlier, and are based on the civil wars now known as The Wars of the Roses — Henry VI, Part 1, Henry VI, Part 2, Henry VI, Part 3 and Richard III. In Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. It is sometimes used to refer to a group of four plays (a tetralogy), but some sources and scholars use the term to refer to eight plays. In the 19th century, Algernon Charles Swinburne used the term to refer to three plays, but that use is not current. In one sense, Henriad refers to: Richard II; Henry IV, Part 1; Henry IV, Part 2; and Henry V — with the implication that these four plays are Shakespeare's epic, and that Prince Harry, who later becomes Henry V, is the epic hero. (This group may also be referred to as the \"second tetralogy\" or \"second Henriad\".) In a more inclusive meaning, Henriad refers to eight plays: the tetralogy mentioned above (Richard II, Henry IV, Part 1, Henry IV, Part 2, and Henry V), plus four plays that were written earlier, and are based on the civil wars now known as The Wars of the Roses — Henry VI, Part 1, Henry VI, Part 2, Henry VI, Part 3 and Richard III. In Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. It is sometimes used to refer to a group of four plays (a tetralogy), but some sources and scholars use the term to refer to eight plays. In the 19th century, Algernon Charles Swinburne used the term to refer to three plays, but that use is not current. In one sense, Henriad refers to: Richard II; Henry IV, Part 1; Henry IV, Part 2; and Henry V — with the implication that these four plays are Shakespeare's epic, and that Prince Harry, who later becomes Henry V, is the epic hero. (This group may also be referred to as the \"second tetralogy\" or \"second Henriad\".)[1][2] In a more inclusive meaning, Henriad refers to eight plays: the tetralogy mentioned above (Richard II, Henry IV, Part 1, Henry IV, Part 2, and Henry V), plus four plays that were written earlier, and are based on the civil wars now known as The Wars of the Roses — Henry VI, Part 1, Henry VI, Part 2, Henry VI, Part 3 and Richard III.[3] In Shakespearean scholarship, Henriad refers to a group of Wil Henriad refers to a group of Wil Henriad refers to a group of Wil Henriad refers to a group of Wil Henriad refers\nIn Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. It is sometimes used to refer to a group of four plays (a tetralogy), but some sources and scholars use the term to refer to eight plays. In the 19th century, Algernon Charles Swinburne used the term to refer to three plays, but that use is not current. In one sense, Henriad refers to: Richard II; Henry IV, Part 1; Henry IV, Part 2; and Henry V — with the implication that these four plays are Shakespeare's epic, and that Prince Harry, who later becomes Henry V, is the epic hero. (This group may also be referred to as the \"second tetralogy\" or \"second Henriad\".)[1][2] In a more inclusive meaning, Henriad refers to eight plays: the tetralogy mentioned above (Richard II, Henry IV, Part 1, Henry IV, Part 2, and Henry V), plus four plays that were written earlier, and are based on the civil wars now known as The Wars of the Roses — Henry VI, Part 1, Henry VI, Part 2, Henry VI, Part 3 and Richard III. In Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. It is sometimes used to refer to a group of four plays (a tetralogy), but some sources and scholars use the term to refer to eight plays. In the 19th century, Algernon Charles Swinburne used the term to refer to three plays, but that use is not current. In one sense, Henriad refers to: Richard II; Henry IV, Part 1; Henry IV, Part 2; and Henry V — with the implication that these four plays are Shakespeare's epic, and that Prince Harry, who later becomes Henry V, is the epic hero. (This group may also be referred to as the \"second tetralogy\" or \"second Henriad\".) In a more inclusive meaning, Henriad refers to eight plays: the tetralogy mentioned above (Richard II, Henry IV, Part 1, Henry IV, Part 2, and Henry V), plus four plays that were written earlier, and are based on the civil wars now known as The Wars of the Roses — Henry VI, Part 1, Henry VI, Part 2, Henry VI, Part 3 and Richard III. In Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. It is sometimes used to refer to a group of four plays (a tetralogy), but some sources and scholars use the term to refer to eight plays. In the 19th century, Algernon Charles Swinburne used the term to refer to three plays, but that use is not current. In one sense, Henriad refers to: Richard II; Henry IV, Part 1; Henry IV, Part 2; and Henry V — with the implication that these four plays are Shakespeare's epic, and that Prince Harry, who later becomes Henry V, is the epic hero. (This group may also be referred to as the \"second tetralogy\" or \"second Henriad\".)[1][2] In a more inclusive meaning, Henriad refers to eight plays: the tetralogy mentioned above (Richard II, Henry IV, Part 1, Henry IV, Part 2, and Henry V), plus four plays that were written earlier, and are based on the civil wars now known as The Wars of the Roses — Henry VI, Part 1, Henry VI, Part 2, Henry VI, Part 3 and Richard III.[3] In Shakespearean scholarship, Henriad refers to a group of Wil Henriad refers to a group of Wil Henriad refers to a group of Wil Henriad refers to a group of Wil Henriad refers\nIn Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. It is sometimes used to refer to a group of four plays (a tetralogy), but some sources and scholars use the term to refer to eight plays. In the 19th century, Algernon Charles Swinburne used the term to refer to three plays, but that use is not current. In one sense, Henriad refers to: Richard II; Henry IV, Part 1; Henry IV, Part 2; and Henry V — with the implication that these four plays are Shakespeare's epic, and that Prince Harry, who later becomes Henry V, is the epic hero. (This group may also be referred to as the \"second tetralogy\" or \"second Henriad\".)[1][2] In a more inclusive meaning, Henriad refers to eight plays: the tetralogy mentioned above (Richard II, Henry IV, Part 1, Henry IV, Part 2, and Henry V), plus four plays that were written earlier, and are based on the civil wars now known as The Wars of the Roses — Henry VI, Part 1, Henry VI, Part 2, Henry VI, Part 3 and Richard III. In Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. It is sometimes used to refer to a group of four plays (a tetralogy), but some sources and scholars use the term to refer to eight plays. In the 19th century, Algernon Charles Swinburne used the term to refer to three plays, but that use is not current. In one sense, Henriad refers to: Richard II; Henry IV, Part 1; Henry IV, Part 2; and Henry V — with the implication that these four plays are Shakespeare's epic, and that Prince Harry, who later becomes Henry V, is the epic hero. (This group may also be referred to as the \"second tetralogy\" or \"second Henriad\".) In a more inclusive meaning, Henriad refers to eight plays: the tetralogy mentioned above (Richard II, Henry IV, Part 1, Henry IV, Part 2, and Henry V), plus four plays that were written earlier, and are based on the civil wars now known as The Wars of the Roses — Henry VI, Part 1, Henry VI, Part 2, Henry VI, Part 3 and Richard III. In Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. It is sometimes used to refer to a group of four plays (a tetralogy), but some sources and scholars use the term to refer to eight plays. In the 19th century, Algernon Charles Swinburne used the term to refer to three plays, but that use is not current. In one sense, Henriad refers to: Richard II; Henry IV, Part 1; Henry IV, Part 2; and Henry V — with the implication that these four plays are Shakespeare's epic, and that Prince Harry, who later becomes Henry V, is the epic hero. (This group may also be referred to as the \"second tetralogy\" or \"second Henriad\".)[1][2] In a more inclusive meaning, Henriad refers to eight plays: the tetralogy mentioned above (Richard II, Henry IV, Part 1, Henry IV, Part 2, and Henry V), plus four plays that were written earlier, and are based on the civil wars now known as The Wars of the Roses — Henry VI, Part 1, Henry VI, Part 2, Henry VI, Part 3 and Richard III.[3] In Shakespearean scholarship, Henriad refers to a group of Wil Henriad refers to a group of Wil Henriad refers to a group of Wil Henriad refers to a group of Wil Henriad refers\nIn Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. It is sometimes used to refer to a group of four plays (a tetralogy), but some sources and scholars use the term to refer to eight plays. In the 19th century, Algernon Charles Swinburne used the term to refer to three plays, but that use is not current. In one sense, Henriad refers to: Richard II; Henry IV, Part 1; Henry IV, Part 2; and Henry V — with the implication that these four plays are Shakespeare's epic, and that Prince Harry, who later becomes Henry V, is the epic hero. (This group may also be referred to as the \"second tetralogy\" or \"second Henriad\".)[1][2] In a more inclusive meaning, Henriad refers to eight plays: the tetralogy mentioned above (Richard II, Henry IV, Part 1, Henry IV, Part 2, and Henry V), plus four plays that were written earlier, and are based on the civil wars now known as The Wars of the Roses — Henry VI, Part 1, Henry VI, Part 2, Henry VI, Part 3 and Richard III. In Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. It is sometimes used to refer to a group of four plays (a tetralogy), but some sources and scholars use the term to refer to eight plays. In the 19th century, Algernon Charles Swinburne used the term to refer to three plays, but that use is not current. In one sense, Henriad refers to: Richard II; Henry IV, Part 1; Henry IV, Part 2; and Henry V — with the implication that these four plays are Shakespeare's epic, and that Prince Harry, who later becomes Henry V, is the epic hero. (This group may also be referred to as the \"second tetralogy\" or \"second Henriad\".) In a more inclusive meaning, Henriad refers to eight plays: the tetralogy mentioned above (Richard II, Henry IV, Part 1, Henry IV, Part 2, and Henry V), plus four plays that were written earlier, and are based on the civil wars now known as The Wars of the Roses — Henry VI, Part 1, Henry VI, Part 2, Henry VI, Part 3 and Richard III. In Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. It is sometimes used to refer to a group of four plays (a tetralogy), but some sources and scholars use the term to refer to eight plays. In the 19th century, Algernon Charles Swinburne used the term to refer to three plays, but that use is not current. In one sense, Henriad refers to: Richard II; Henry IV, Part 1; Henry IV, Part 2; and Henry V — with the implication that these four plays are Shakespeare's epic, and that Prince Harry, who later becomes Henry V, is the epic hero. (This group may also be referred to as the \"second tetralogy\" or \"second Henriad\".)[1][2] In a more inclusive meaning, Henriad refers to eight plays: the tetralogy mentioned above (Richard II, Henry IV, Part 1, Henry IV, Part 2, and Henry V), plus four plays that were written earlier, and are based on the civil wars now known as The Wars of the Roses — Henry VI, Part 1, Henry VI, Part 2, Henry VI, Part 3 and Richard III.[3] In Shakespearean scholarship, Henriad refers to a group of Wil Henriad refers to a group of Wil Henriad refers to a group of Wil Henriad refers to a group of Wil Henriad refers";
    let sentences = segmenter.segment(text).collect::<Vec<_>>();
    // print the words
    for sentence in sentences {
        println!("{}", sentence);
    }
}

outputs

In Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. 
In Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. 
In Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. 
In Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. 
In Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. 
In Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. 
In Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. 
In Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. 
In Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. 
In Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. 
In Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. 
In Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. 
In Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. 
In Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. 
In Shakespearean scholarship, Henriad refers to a group of William Shakespeare's history plays. 
It is sometimes used to refer to a group of four plays (a tetralogy), but some sources and scholars use the term to refer to eight plays. 
In the 19th century, Algernon Charles Swinburne used the term to refer to three plays, but that use is not current. 
In one sense, Henriad refers to: Richard II; Henry IV, Part 1; Henry IV, Part 2; and Henry V — with the implication that these four plays are Shakespeare's epic, and that Prince Harry, who later becomes Henry V, is the epic hero. 
(This group may also be referred to as the "second tetralogy" or "second Henriad".)[1][2] In a more inclusive meaning, Henriad refers to eight plays: the tetralogy mentioned above (Richard II, Henry IV, Part 1, Henry IV, Part 2, and Henry V), plus four plays that were written earlier, and are based on the civil wars now known as The Wars of the Roses — Henry VI, Part 1, Henry VI, Part 2, Henry VI, Part 3 and Richard III.
(This group may also be referred to as the "second tetralogy" or "second Henriad".)
In a more inclusive meaning, Henriad refers to eight plays: the tetralogy mentioned above (Richard II, Henry IV, Part 1, Henry IV, Part 2, and Henry V), plus four plays that were written earlier, and are based on the civil wars now known as The Wars of the Roses — Henry VI, Part 1, Henry VI, Part 2, Henry VI, Part 3 and Richard III.
In a more inclusive meaning, Henriad refers to eight plays: the tetralogy mentioned above (Richard II, Henry IV, Part 1, Henry IV, Part 2, and Henry V), plus four plays that were written earlier, and are based on the civil wars now known as The Wars of the Roses — Henry VI, Part 1, Henry VI, Part 2, Henry VI, Part 3 and Richard III.
In a more inclusive meaning, Henriad refers to eight plays: the tetralogy mentioned above (Richard II, Henry IV, Part 1, Henry IV, Part 2, and Henry V), plus four plays that were written earlier, and are based on the civil wars now known as The Wars of the Roses — Henry VI, Part 1, Henry VI, Part 2, Henry VI, Part 3 and Richard III.
In a more inclusive meaning, Henriad refers to eight plays: the tetralogy mentioned above (Richard II, Henry IV, Part 1, Henry IV, Part 2, and Henry V), plus four plays that were written earlier, and are based on the civil wars now known as The Wars of the Roses — Henry VI, Part 1, Henry VI, Part 2, Henry VI, Part 3 and Richard III.
In a more inclusive meaning, Henriad refers to eight plays: the tetralogy mentioned above (Richard II, Henry IV, Part 1, Henry IV, Part 2, and Henry V), plus four plays that were written earlier, and are based on the civil wars now known as The Wars of the Roses — Henry VI, Part 1, Henry VI, Part 2, Henry VI, Part 3 and Richard III.

which is clearly very wrong

simnalamburt commented 1 year ago

Thanks for your report. This is clearly a bug, and this doesn't match the behavior of pysbd v0.3.1. I could successfully reproduced the issue at https://github.com/simnalamburt/rust-pragmatic-segmenter/commit/f0a8bafe980d92c1ac9c33a61f0b24fbf4eef392.

Unfortunately, I am no longer actively managing this project. I can't give you any promises on when it will be fixed.