Most modern organisations use the web and social media platforms for employee recruitment and job opening advertisements. In recent years, many companies prefer to post their vacancies online so that they can be accessed easily and quickly by interested job-seekers.The number of job seekers who search online for employment opportunities has in effect increased exponentially. Unfortunately, with this trend comes the opportunity for criminals to exploit job seekers with fake job offers as a way to extract personal information to be used in nefarious activities. Thus, a system for detecting fraudulent job postings is necessary for both the credibility of Job posting sites or companies and also for the safety of job seekers. By applying machine learning techniques, we can help job seekers detect job ads that are fraudulent and deploy a job recommendation system.
For this project, a raw dataset containing fraudulent and real jobs from about 17880 job posting observations was used. The (dataset)[https://www.kaggle.com/subhajournal/job-fraud-detection] was obtained from (kaggle)[www.kaggle.com] Full URL: https://www.kaggle.com/subhajournal/job-fraud-detection This dataset displays the different types of jobs, their location, salary range, job descriptions and roles.
The dataset consisted of 17,880 observations and 16 features.The data was a combination of integer and string data types. The dataset was made up of some features with a lot of null values and some with little or no null values. A 60% threshold for null values was used to identify the columns that were dropped from the dataset.
In order to better understand the data and get information from the dataset used, some exploratory data analysis was carried out.
As seen below, the Location with the most count in the dataset is GB, London and the follow up location is US, New York. The other top 8 can be observed in the chart below.
A countplot was also obtained from the Experience and Type_of_Employment columns as shown below. The most occurring category in the Experience column was the ”Not stated” followed by “Mid-Senior level”. It was also observed that the most occurring jobs were full time roles. The Job advertisements with no experience stated in them were shown to be most likely to contain Fraudulent jobs than any other category.
The plot above shows that titles like “Customer Service Representative” and “Administrative Assistant” contained the most number of fake job advertisements.
We can see that the Qualification column also has most of the fake job ads where the required qualifications were not stated.
It was found out that 85% of the jobs in the dataset had no salary range provided or they were unpaid jobs. 86% of the real jobs had no salary range and 75% of the fake jobs had no salary range.
The histogram revealed similar distributions for the real and fake jobs postings. The boxplot is slightly more revealing, the real jobs had a greater spread than the fake ones, but the medians were about the same (although, it's difficult to believe that the mean salary was around $10,000).
The preprocessing stage for our data was very straightforward since we had a majority of text columns.They include the following:
We then generated a correlation matrix to see the importance of the numerical features. The matrix is shown below:
The matrix indicated that the numerical features had very low correlation with the target Fraudulent feature. Therefore, we dropped the numerical features and focused on the text feature we had cleaned.
A total of 9 models were trained on the vectorized data and the best three performing models were fine tuned using grid search before selecting the best performing model. A passive aggressive classifier showed the best performance on the test set and was therefore selected.
The model was evaluated primarily using three evaluation metrics. Precision, recall and F1 score. The F1 score was used as the most important metric since it was a combination of precision and recall and the result we obtained can be seen in the image below.
The model was deployed using streamlit after the vectorizer and the trained model were pickled. These pickled files were used in the deployment to classify text inputs that contained details of the Job being advertised.