livehelpnow / tds_ecto

TDS Adapter for Ecto
57 stars 34 forks source link

String with accents are converted to binary #63

Closed nicklayb closed 6 years ago

nicklayb commented 6 years ago

My SQL Server database uses a lot of accent (é, è, ï, ê) since major part of our customer base are french. When I'm querying the database for a customer with an accent in his name, the name field is now converted as a binary, for instance:

iex(7)> MZ.Repo.get(MZ.Customer, 6100)
[debug] QUERY OK source="Customers" db=17.9ms queue=0.1ms
SELECT C0.[id], C0.[customerNumber], C0.[name], C0.[languageId] FROM [Customers] AS C0 WHERE (C0.[id] = @1) [6100]
%MZ.Customer{
  __meta__: #Ecto.Schema.Metadata<:loaded, "Customers">,
  customer_number: "12345",
  id: 6100,
  language: #Ecto.Association.NotLoaded<association :language is not loaded>,
  languageId: nil,
  name: "Paul Gaboury",
  users: #Ecto.Association.NotLoaded<association :users is not loaded>
}

But with accent

iex(9)> MZ.Repo.get(MZ.Customer, 6562)
[debug] QUERY OK source="Customers" db=23.2ms
SELECT C0.[id], C0.[customerNumber], C0.[name], C0.[languageId] FROM [Customers] AS C0 WHERE (C0.[id] = @1) [6562]
%MZ.Customer{
  __meta__: #Ecto.Schema.Metadata<:loaded, "Customers">,
  customer_number: "54321",
  id: 6562,
  language: #Ecto.Association.NotLoaded<association :language is not loaded>,
  languageId: 1,
  name: <<201, 99, 111, 108, 101, 32, 76, 101, 32, 112, 101, 116,
    105, 116, 32, 80, 97, 117, 108>>,             # <-- Should be "École le petit paul"
  users: #Ecto.Association.NotLoaded<association :users is not loaded>
}

Is there any setting I can add to make sure it handle correctly all the accents encoding? I have tried the following hook as you suggested but did nothing :

  def after_connect(conn) do
    Tds.Ecto.Connection.query(conn, "SET LANGUAGE French", [], [])
  end

Thank you!

Edit I also tried changing my :string field to a Tsd.VarChar but does the same

mjaric commented 6 years ago

Could you please post here or send me email with table definition, ecto model schema and migration file. I need to know what actual db and elixir types are involved in the issue

mjaric commented 6 years ago

One more thing, could you please tell me what is db collation?

nicklayb commented 6 years ago

Of course, here they are:

Table:

CREATE TABLE Beta8.dbo.Customers (
    ID int NOT NULL IDENTITY(1,1),
    CustomerNumber varchar(50),
    Name varchar(75),
    Address varchar(250),
    CountryID int,
    StateID int,
    CityID int,
    ZipCodeID int,
    Office varchar(10),
    IsIndividual bit,
    IsMainCustomer bit,
    Blacklist bit,
    Prospect bit,
    MainCustomerID int,
    PhoneNumber varchar(25),
    PhoneNumberExtention varchar(8),
    Fax varchar(25),
    Cell varchar(25),
    PhoneOther varchar(25),
    Email varchar(100),
    WebSite varchar(100),
    CustomerStatusID int,
    RecoveryRateSingle numeric(22,2),
    CharterRecoveryRateID int,
    InterestRate numeric(22,2),
    GracePeriod int,
    OpeningDate datetime,
    SubscriptionEndDate datetime,
    LanguageID int,
    CurrencyID int,
    RoundTripKm numeric(22,2),
    ContactMethod int,
    AccountDirectorID int,
    SellerID int,
    CustomerServiceManagerID int,
    AdministrativeContact varchar(100),
    Revenue money,
    NbEmpl int,
    PaymentsTermsID int,
    LocationBillingID int,
    NameOnCheck varchar(100),
    ReferenceNumber varchar(100),
    Payback int,
    BillingFrequency int,
    ActionID int,
    AssignedNewFileID int,
    DisplayDebtorsInOrder int,
    ChargeCommissionToDebtor bit,
    TransferTheCredit bit,
    DateCreditFile datetime,
    DirectDeposit bit,
    BankNumber varchar(50),
    TransitNumber varchar(50),
    FolioNumber varchar(50),
    WebUsername varchar(100),
    WebPassword text,
    PreauthorizedPaymentPercentage int,
    Statistic int,
    ReferenceID int,
    CentralizedFiles bit,
    FileStateID int,
    FileCategoryID int,
    ProfileID int,
    FileStatusID int,
    IsAgency bit,
    SendInvoiceByEmail bit,
    DebtorDefaultWrittenLanguage int,
    Note text,
    Inactive bit,
    DepartmentCustomerID int,
    RecallDate datetime,
    RecallComments text,
    OpenFeeNote varchar(20),
    SpecificPhoneNumber varchar(20),
    RecallResourceID int,
    DontTaxeEquipment bit,
    SIFNotAllowed bit,
    SendAutomatedEmails bit,
    AddEmptyPageToExportLetter bit,
    SendAutomatedEmailsNewFile bit,
    SendAutomatedEmailsCloseFile bit,
    SendAutomatedEmailsChangeStatusFile bit,
    SendAutomatedEmailsNewPayment bit,
    SendAutomatedEmailsPaymentAgreementCreated bit,
    WebInactive bit,
    [_IT3ApiToken] varchar(512),
    IsCommeleo bit,
    MinCapitalToTransfert money,
    Exclusive bit,
    SuccessRate numeric(22,2),
    LastFileOpeningDate datetime,
    WrittenLanguage int,
    CONSTRAINT PK_Customers PRIMARY KEY (ID),
    CONSTRAINT FK_Actions_Customers_ActionID FOREIGN KEY (ActionID) REFERENCES Beta8.dbo.Actions(ID) ON DELETE RESTRICT ON UPDATE RESTRICT,
    CONSTRAINT FK_CharterRecoveryRates_Customers_CharterRecoveryRateID FOREIGN KEY (CharterRecoveryRateID) REFERENCES Beta8.dbo.CharterRecoveryRates(ID) ON DELETE RESTRICT ON UPDATE RESTRICT,
    CONSTRAINT FK_CountryStateCity_Customers_CityID FOREIGN KEY (CityID) REFERENCES Beta8.dbo.CountryStateCity(ID) ON DELETE RESTRICT ON UPDATE RESTRICT,
    CONSTRAINT FK_CountryState_Customers_StateID FOREIGN KEY (StateID) REFERENCES Beta8.dbo.CountryState(ID) ON DELETE RESTRICT ON UPDATE RESTRICT,
    CONSTRAINT FK_Country_Customers_CountryID FOREIGN KEY (CountryID) REFERENCES Beta8.dbo.Country(ID) ON DELETE RESTRICT ON UPDATE RESTRICT,
    CONSTRAINT FK_Currency_Customers_CurrencyID FOREIGN KEY (CurrencyID) REFERENCES Beta8.dbo.Currency(ID) ON DELETE RESTRICT ON UPDATE RESTRICT,
    CONSTRAINT FK_CustomerStatus_Customers_CustomerStatusID FOREIGN KEY (CustomerStatusID) REFERENCES Beta8.dbo.CustomerStatus(ID) ON DELETE RESTRICT ON UPDATE RESTRICT,
    CONSTRAINT FK_Customers_Customers_MainCustomerID FOREIGN KEY (MainCustomerID) REFERENCES Beta8.dbo.Customers(ID) ON DELETE RESTRICT ON UPDATE RESTRICT,
    CONSTRAINT FK_DebtorsProfiles_Customers_ProfileID FOREIGN KEY (ProfileID) REFERENCES Beta8.dbo.DebtorsProfiles(ID) ON DELETE RESTRICT ON UPDATE RESTRICT,
    CONSTRAINT FK_DepartmentsCustomers_Customers_DepartmentCustomerID FOREIGN KEY (DepartmentCustomerID) REFERENCES Beta8.dbo.DepartmentsCustomers(ID) ON DELETE RESTRICT ON UPDATE RESTRICT,
    CONSTRAINT FK_FilesCategories_Customers_FileCategoryID FOREIGN KEY (FileCategoryID) REFERENCES Beta8.dbo.FilesCategories(ID) ON DELETE RESTRICT ON UPDATE RESTRICT,
    CONSTRAINT FK_FilesStates_Customers_FileStateID FOREIGN KEY (FileStateID) REFERENCES Beta8.dbo.FilesStates(ID) ON DELETE RESTRICT ON UPDATE RESTRICT,
    CONSTRAINT FK_FilesStatus_Customers_FileStatusID FOREIGN KEY (FileStatusID) REFERENCES Beta8.dbo.FilesStatus(ID) ON DELETE RESTRICT ON UPDATE RESTRICT,
    CONSTRAINT FK_Languages_Customers_LanguageID FOREIGN KEY (LanguageID) REFERENCES Beta8.dbo.Languages(ID) ON DELETE RESTRICT ON UPDATE RESTRICT,
    CONSTRAINT FK_LocationsBilling_Customers_LocationBillingID FOREIGN KEY (LocationBillingID) REFERENCES Beta8.dbo.LocationsBilling(ID) ON DELETE RESTRICT ON UPDATE RESTRICT,
    CONSTRAINT FK_PaymentsTerms_Customers_PaymentsTermsID FOREIGN KEY (PaymentsTermsID) REFERENCES Beta8.dbo.PaymentsTerms(ID) ON DELETE RESTRICT ON UPDATE RESTRICT,
    CONSTRAINT FK_Reference_Customers_ReferenceID FOREIGN KEY (ReferenceID) REFERENCES Beta8.dbo.Reference(ID) ON DELETE RESTRICT ON UPDATE RESTRICT,
    CONSTRAINT FK_Resources_Customers_AccountDirectorID FOREIGN KEY (AccountDirectorID) REFERENCES Beta8.dbo.Resources(ID) ON DELETE RESTRICT ON UPDATE RESTRICT,
    CONSTRAINT FK_Resources_Customers_AssignedNewFileID FOREIGN KEY (AssignedNewFileID) REFERENCES Beta8.dbo.Resources(ID) ON DELETE RESTRICT ON UPDATE RESTRICT,
    CONSTRAINT FK_Resources_Customers_CustomerServiceManagerID FOREIGN KEY (CustomerServiceManagerID) REFERENCES Beta8.dbo.Resources(ID) ON DELETE RESTRICT ON UPDATE RESTRICT,
    CONSTRAINT FK_Resources_Customers_RecallResourceID FOREIGN KEY (RecallResourceID) REFERENCES Beta8.dbo.Resources(ID) ON DELETE RESTRICT ON UPDATE RESTRICT,
    CONSTRAINT FK_Resources_Customers_SellerID FOREIGN KEY (SellerID) REFERENCES Beta8.dbo.Resources(ID) ON DELETE RESTRICT ON UPDATE RESTRICT,
    CONSTRAINT FK_ZipCodes_Customers_ZipCodeID FOREIGN KEY (ZipCodeID) REFERENCES Beta8.dbo.ZipCodes(ID) ON DELETE RESTRICT ON UPDATE RESTRICT
) go
CREATE INDEX IX_LocationBilling ON Beta8.dbo.Customers (LocationBillingID) go
CREATE INDEX IX_Name ON Beta8.dbo.Customers (Name) go;

Schema:

defmodule MZ.Customer do
  use Ecto.Schema

  schema "Customers" do
    field :customer_number, :string, source: :customerNumber
    field :name, Tds.VarChar

    belongs_to :language, MZ.Language, foreign_key: :languageId
    has_many :users, MZ.User, foreign_key: :exId
    has_many :files, MZ.File, foreign_key: :customerId
  end
end

The collation is : SQL_Latin1_General_CP1_CI_AS

I don't have migrations since I connect to an existing database from our main software which is not our creation.

mjaric commented 6 years ago

the issue is that collation and column type for :name column are not correct. TDS will not threat VarChar columns as ucs2 strings regardless bytes are "there" since that is same thing MSSQL server does. We had a lot of issues earlier when that was supported ad we tried to threat VarChar as UCS2 string. Your column name, should be of NVarChar type since NOT ALL French letters are 1 byte long. NVarChar is ucs2 string, meaning it will use 2 bytes to store single character. I know this sounds wired, but MS has a lot wired stuff under the hood, it will literally take all bytes you send them and store it to VarChar column without complaining, it will not try to check if codepoints are valid. Of course, this would have huge performance implications to server itself if they have done it. What is even more odd here is that MS will give you back bytes as is :).

There is nice comparison between VarChar and NVarChar types here that illustrates this problem.

Finally, I can't give you nice solution for this, maybe in future we will find way to handle somehow decoding, but there will be always "how other platform" will store VarChar column in database where elixir (second platform) will read such column. And even if we do it, it will become endless bug fixing. What I could suggest is that you try decode value with Tds.Utils.ucs2_to_utf/1 or use fragment in query from m in Model, .... select: %{name: fragment("cast(? as nvarchar) as name", m.name)} ..... OR create your custom ecto type CustomVarchar and use Tds.Utils.ucs2_to_utf/1 in load function to convert it to utf8 string when String.valid?(value) == false. Then use your custom type in schema instead of Tds.Varchar. You can copy all code from Tds.Varchar module and just amend load function

nicklayb commented 6 years ago

You the real MVP!!

Creating my own types as worked perfectly! Here is what I wrote if anyone requires it:

defmodule MZ.EctoTypes.VarChar do
  @moduledoc """
  Converts TDS Nvarchar to an understandable field for Ecto
  """

  @behaviour Ecto.Type

  @doc "Returns :varchar atom, so we can use :varchar type in migration files"
  def type(), do: :varchar

  def cast(value) do
    {:ok, convert_ucs2 value}
  end

  def load(value) do
    {:ok, convert_ucs2 value}
  end

  def dump(value) when is_binary(value) do
    {:ok, {convert_ucs2(value), :varchar}}
  end

  defp convert_ucs2(value) when is_binary(value) do
    Tds.Utils.ucs2_to_utf(value)
  end

  defp convert_ucs2(value) do value end
end
ksherman commented 4 years ago

@mjaric Sorry for pinging on an old issue! I had this same challenge in my app with varchar fields. We're also using the same collation SQL_Latin1_General_CP1_CI_AS

I'm curious if this is "fixed" or if there's a different solution for the draft PR https://github.com/elixir-ecto/ecto_sql/pull/184?

I updated my decencies to point to that repo and Tds.Types.VarChar still resulted in the previous

** (Jason.EncodeError) invalid byte 0xC3 in <<80, 195, 63, 194, 182, 100, 195, 63, 194, 182, 114, 32, 65, 112, 114, 105, 99, 111, 116, 32, 115, 101, 101, 100, 32, 111, 105, 108, 32, 79, 114, 103, 97, 110, 105, 99>>

Also, is the "right answer" that this column should have been an nvarchar type to support accents etc?

mjaric commented 4 years ago

The best thing/solution you can do is to change that column into nvarchar.

Any workaround you try always has an issue. for instance if column is varchar with Latin1 collation and you try to put non latin1 codepoint e,g,

  1. parameter is nvarchar, then sql server will replace invalid codepoints with "?"
  2. parameter is varchar encoded already with latin1 (depending what encoder you used in your app, it will either contain "?" for invalid codepoints or they will be removed from binary)
  3. You can guess encoding in your app, and for the sake of simplicity lets say that your app guessed correctly, app will have correct binary:
    • if you try to pass that value as varchar parameter, you will get garbage in column and when you try to read, result is not pretty
    • if you try to pass that value as varbinary it will keep codepoints as is and when you read that column it will probably be Ok for you app, but if you checkthat record in mssql using sql management studio you will see garbage, meaning that probably you can not compare that column in where clause. There is high probability that using LIKE will not yield result you want.

Again, your best friend is nvarchar :)

mjaric commented 4 years ago

BTW, comparing time required to alter table column from varchar to nvarchar is most likely much shorter than the time you will spend trying to solve the issue leaving column to varchar, and you will have allways some edge case that doesn't work. NVarchar is meant to be used for multilingual support.

ksherman commented 4 years ago

Totally, makes sense and all that reflects what I've run into saving those characters back into the database. Appreciate the confirmation!

Also, amazing work on the new version, I put it into a test server last night and so far has been a drop-in replacement!

mjaric commented 4 years ago

thanks for the feedback!!! I really appreciate.