Open kemalbastak opened 4 months ago
Hi @kemalbastak
As far as I know, the CCNet pipeline does not support Turkish out of the box, but you can probably modify the pipeline to get it to support tr. We never went through that process, but to get there, I think you have to do the following steps:
I'd also recommend contacting the maintainers of the ccnet if there are issues related to that.
I hope this helps!
I have calculated for 2023-50 CC dump and used 'perplexity' score on that data.
percentiles = {f"%{i}": np.percentile(all_pp_values, i) for i in range(1, 101)}
Got closer values with the existing languages in the cutoff.csv.
Thanks for the answer
I am trying to add turkish language (tr) to cutoff.csv on rp_v1 branch. There is few data on how the language score is calculated How do we add custom language score on this csv?