mhueppe / machineLearningProject_jaNoMi

This is a public Repository to manage the Machine Learning Project for WS 2024/25.
GNU General Public License v3.0
1 stars 0 forks source link

Discuss different encoding techniques. #20

Open mhueppe opened 6 hours ago

mhueppe commented 6 hours ago

Choosing between character-based language models and n-gram/word-based language models depends on the specific task, dataset, and requirements of your application. Here’s a detailed comparison of the two approaches to help determine which is better for your needs:


1. Character-Based Language Models

Character-based models treat individual characters as the fundamental unit of input and output.

Advantages:

Disadvantages:

Use Cases:


2. N-Gram/Word-Based Language Models

Word-based models use words or fixed-length sequences of words (n-grams) as their input units.

Advantages:

Disadvantages:

Use Cases:


Which is Better?

Criterion | Character-Based | N-Gram/Word-Based -- | -- | -- Handling OOV words | Excellent | Poor (unless using subword techniques) Model size and efficiency | Smaller vocabulary but longer sequences | Larger vocabulary but shorter sequences Semantic understanding | Weak, needs to learn from scratch | Stronger, especially with pre-trained embeddings Data requirements | Needs more data to learn character-level patterns | More efficient with limited data Generalization | Can generate unseen words | Limited unless using subword or subphrase methods

Hybrid Approach: Subword Models

Modern language models (e.g., BERT, GPT) combine the strengths of both by using subword tokenization (e.g., byte-pair encoding or WordPiece):

Subword Tokenization Examples:


Conclusion