The Expense Classifier is a full-stack, serverless application that combines computer vision and language modelling to automate the classification of business expenses from receipts. The application leverages Google's Document AI to extract and parse structured data from receipt images and then uses OpenAI's language model to classify the expenses into relevant predefined business categories. Finally, users can download the classified expenses in CSV format.
Businesses and organizations frequently process large volumes of receipts and invoices, often leading to manual, time-consuming tasks such as sorting, classifying, and categorizing expenses. These activities can result in errors, inefficiencies, and a lack of proper insights into business expenditures.
Our application solves this problem by automating the process of extracting data from receipts, classifying the expenses into categories such as "Travel" or "Meals," and providing an easy-to-download CSV file of the classified expenses.
The system architecture consists of:
When you initiate a supervised fine-tuning job, the model learns additional parameters that help it capture the necessary information to execute the intended task or adopt the desired behavior. These parameters are applied during inference, resulting in a new model that integrates the newly acquired knowledge with the original model.
Supervised fine-tuning is particularly effective for text models when the desired output is straightforward and easy to define. It is ideal for tasks such as classification, sentiment analysis, entity extraction, summarizing uncomplicated content, and formulating domain-specific queries. For code models, supervised fine-tuning is the only available approach.
F1 Score: 0.813: This score indicates a strong balance between precision and recall for the labeling task. An F1 score above 0.8 is generally considered good, suggesting that the model effectively identifies labels while maintaining accuracy.
F1 score: the harmonic mean of precision and recall, which combines precision and recall into a single metric, providing equal weight to both. Defined as 2 (Precision Recall) / (Precision + Recall)
Precision: 85.5%: This high precision indicates that when the model labels an instance, it is correct 85.5% of the time. This is important in labeling tasks where false positives can lead to misclassification and impact downstream applications.
Precision: the proportion of predictions that match the annotations in the test set. Defined as True Positives / (True Positives + False Positives)
Recall: 77.6%: The recall score indicates that the model correctly identifies 77.6% of all relevant instances (true positives). While this is decent, there is room for improvement, especially if missing relevant labels (false negatives) could significantly affect our use case.
Recall: the proportion of annotations in the test set that are correctly predicted. Defined as True Positives / (True Positives + False Negatives)
Test Documents: The model was tested on 39 documents, providing a small but manageable dataset for evaluation.
Evaluated Documents: All 39 documents were evaluated, showing that there were no issues with the dataset.
Invalid Documents: With 0 invalid documents, this means all our documents were formatted correctly and usable for the model.
Failed Documents: 0 failed documents indicates a smooth evaluation process without any errors.