girke-lab / ChemmineR

Cheminformatics Toolkit for R
13 stars 7 forks source link

Converting multiple *.mol files into a single *.sdf. #8

Closed QizhiSu closed 1 year ago

QizhiSu commented 3 years ago

Dear ChemmineR developer,

I am trying to combine multiple .mol files into a single .sdf file in R. Since there are no smiles, inchi, and inchikey information in the .mol files, I would also like to add these information in the final .sdf file. I know that it can be easily achieved via cmd openbabel with the following code: obabel *.MOL -O result.sdf --add cansmi InChI InChIKey

I am thinking how can I achieve this in R via ChemmineR?

In addition, when I tried to get inchi and inchikey from a sdf file using propOB() function, i get the following error: *** Open Babel Error in GetStringvalue InChIFormat is not loaded

Any clue for this issue?

thanks in advance.

Best, Sukis

khoran commented 3 years ago

You should be able to read in the mol files using the function read.SDFset since SDF is a superset of MOL. Then you can concatenate your mol file into one list using the c() function. Then you can update the datablock with the cansmi and inchi data with

datablock(my_mol_files) = propOB(my_mol_files)

Then you can write them out with write.SDF(my_mol_files). This will write in SDF format though.

Are you running on windows? I believe the Inchi OpenBabel plugin doesn't work correctly on Windows, unfortunately. Try it again on Linux or Mac if possible.

QizhiSu commented 3 years ago

Understand. Thank you very much. I am working on Windows, I will give it a try in Linux.

best, Sukis

tgirke commented 3 years ago

I am not sure if we have a function in ChemmineR for this. When working with many structures then using SDF (or smiles) format is usually recommended. What you could do in ChemmineR, you could store the file paths to each mol file in a character vector and then import each into an SDFset object iteratively in a loop. The standard list.files R function can be used to create the file paths with a regular expression or wild card in a single step like so:

mymolfiles <- list.files(path="...", pattern="...")# -> now loop to import into SDFset

T.G.

On Wed, May 19, 2021 at 2:22 AM Sukis123 @.***> wrote:

Dear Khoran, I can read a single .mol by read.SDFset() at a time, and then combine the using c(). But I failed to read all mol together. I was trying this: lapply(mol_sub, read.SDFset). The code can read all *.mol at once, but the ouput is a list rather than sdfset, so I cannot save.sdf.

Any clue?

best regards, Sukis

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/girke-lab/ChemmineR/issues/8#issuecomment-843917632, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKGMVH7ZJPYSDB2EJA4DSDTON7M3ANCNFSM45ASJNRQ .

-- Thomas Girke, Ph.D. Professor of Bioinformatics 1207F Genomics Building University of California Riverside, CA 92521

E-mail: @.*** URL: https://girke.bioinformatics.ucr.edu Phone/Cell/Text: 951-732-7072 Fax: 951-827-4437

QizhiSu commented 3 years ago

I am not sure if we have a function in ChemmineR for this. When working with many structures then using SDF (or smiles) format is usually recommended. What you could do in ChemmineR, you could store the file paths to each mol file in a character vector and then import each into an SDFset object iteratively in a loop. The standard list.files R function can be used to create the file paths with a regular expression or wild card in a single step like so: mymolfiles <- list.files(path="...", pattern="...")# -> now loop to import into SDFset T.G.

Dear tgirke,

I have tried as you suggested with the following code.

mol_files <- list.files(path = *****, pattern = '*.MOL', full.names = TRUE)

extract_meta <- function(files) {
           meta_data <- lapply(files, function(x){
                                            sdf <- read.SDFset(x, skipErrors = TRUE)
                                            meta_data <- propOB(sdf)
                                            return(meta_data)
            })
meta_data <- do.call(rbind, meta_data)
}

data <- extract_meta(mol_files)

This code works in most cases, but I found that for some compounds (I share an example mol file here https://unizares-my.sharepoint.com/:u:/g/personal/773609_unizar_es/EdU0hjmsYm1Mj5V5ZeVm0nMBCH2k8_KuL2IsTFkxnJhfjQ?e=kOxMOC), the R crashes.sdfset() is able to read this mol file, but when I propOB() the read file, R crashes. I also tried it in openbabel. obabel problem.mol -O problem.sdf --add inchi inchikey is ok, but obabel problem.mol -O problem.sdf --add cansmi has the same problem. So I thought the problem is calculating the cansmi of this compound. when I run obabel *.mol -O output.sdf --add cansmi inchi inchikey in openbabel, it works perfectly and some compounds were removed (I think those are the problematic ones). So i am thinking if it is a way to exclude problematic compounds in R as well since my *.mol files have more than one problematic compound and I don't know where they are. I see no options in propOB() like skipError in the read.SDFset().

Best, Sukis

tgirke commented 3 years ago

You want to first import all your compounds into an SDFset and then write it out to a batch SDF file with write SDFset. After that you import them with read.SDFset and then do the OB part. During the import it will prompt you which compounds may be invalid or problematic and then remove them as instructed.

T.G.

On Thu, May 20, 2021 at 3:07 AM Sukis123 @.***> wrote:

I am not sure if we have a function in ChemmineR for this. When working with many structures then using SDF (or smiles) format is usually recommended. What you could do in ChemmineR, you could store the file paths to each mol file in a character vector and then import each into an SDFset object iteratively in a loop. The standard list.files R function can be used to create the file paths with a regular expression or wild card in a single step like so: mymolfiles <- list.files(path="...", pattern="...")# -> now loop to import into SDFset T.G.

Dear tgirke,

I have tried as you suggested with the following code.

mol_files <- list.files(path = ****, pattern = '.MOL', full.names = TRUE)

extract_meta <- function(files) { meta_data <- lapply(files, function(x){ sdf <- read.SDFset(x, skipErrors = TRUE) meta_data <- propOB(sdf) return(meta_data) }) meta_data <- do.call(rbind, meta_data) }

data <- extract_meta(mol_files)

This code works in most cases, but I found that for some compounds (I share an example mol file here https://unizares-my.sharepoint.com/:u:/g/personal/773609_unizar_es/EdU0hjmsYm1Mj5V5ZeVm0nMBCH2k8_KuL2IsTFkxnJhfjQ?e=kOxMOC), the R crashes.sdfset() is able to read this mol file, but when I propOB() the read file, R crashes. I also tried it in openbabel. obabel problem.mol -O problem.sdf --add inchi inchikey is ok, but obabel problem.mol -O problem.sdf --add cansmi has the same problem. So I thought the problem is calculating the cansmi of this compound. when I run obabel .mol -O output.sdf --add cansmi inchi inchikey in openbabel, it works perfectly and some compounds were removed (I think those are the problematic ones). So i am thinking if it is a way to exclude problematic compounds in R as well since my .mol files have more than one problematic compound and I don't know where they are. I see no options in propOB() like skipError in the read.SDFset().

Best, Sukis

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/girke-lab/ChemmineR/issues/8#issuecomment-844944572, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKGMVC3BWXRJKIVMBONNYTTOTNPFANCNFSM45ASJNRQ .

-- Thomas Girke, Ph.D. Professor of Bioinformatics 1207F Genomics Building University of California Riverside, CA 92521

E-mail: @.*** URL: https://girke.bioinformatics.ucr.edu Phone/Cell/Text: 951-732-7072 Fax: 951-827-4437

QizhiSu commented 3 years ago

Unfortunately, the exported *.sdf file can be perfectly read by readSDFset(), but the problem remains with the newly read sdf.

cheers, Sukis

tgirke commented 3 years ago

Could you provide the SDF with the all or most of the relevant compounds so that we can take a look. Also which version of R and ChemmineR and ChemmineOB are you using? sessionInfo() output provides this.

On Thu, May 20, 2021 at 8:21 AM Sukis123 @.***> wrote:

Unfortunately, the exported *.sdf file can be perfectly read by readSDFset(), but the problem remains with the newly read sdf.

cheers, Sukis

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/girke-lab/ChemmineR/issues/8#issuecomment-845215546, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKGMVFELCEPG2N6RXCMHCLTOUSHNANCNFSM45ASJNRQ .

-- Thomas Girke, Ph.D. Professor of Bioinformatics 1207F Genomics Building University of California Riverside, CA 92521

E-mail: @.*** URL: https://girke.bioinformatics.ucr.edu Phone/Cell/Text: 951-732-7072 Fax: 951-827-4437

QizhiSu commented 3 years ago

At the moment, I only identify one compound, but there should be more because after removing this compound I still have the same problem. I share a link to download the sdf: https://unizares-my.sharepoint.com/:u:/g/personal/773609_unizar_es/ETag3-6aWaVGgJNX6QrmBG4BXy2Sj0RmlPR9U2QHFLcRVQ?e=Ov0wSR. The mol file has already been shared before.

the system used is: R version 4.0.5 (2021-03-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252
system code page: 936

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] compiler_4.0.5 tools_4.0.5

I also test it in Ubuntu and I got the same problem. below is the setting in Ubuntu.

R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] compiler_4.1.0 tools_4.1.0

tgirke commented 3 years ago

Somehow this is an unusual structure. Definitely not a drug-like or organic compound derived from metabolites or similar. When you import it and then plot the structure with just plot(...) or sdf.visualize(...) then you can see it. So I am not sure what to suggest here.

On Thu, May 20, 2021 at 8:52 AM Sukis123 @.***> wrote:

At the moment, I only identify one compound, but there should be more because after removing this compound I still have the same problem. I share a link to download the sdf: https://unizares-my.sharepoint.com/:u:/g/personal/773609_unizar_es/ETag3-6aWaVGgJNX6QrmBG4BXy2Sj0RmlPR9U2QHFLcRVQ?e=Ov0wSR. The mol file has already been shared before.

the system used is: R version 4.0.5 (2021-03-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 system code page: 936

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] compiler_4.0.5 tools_4.0.5

I also test it in Ubuntu and I got the same problem. below is the setting in Ubuntu.

R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] compiler_4.1.0 tools_4.1.0

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/girke-lab/ChemmineR/issues/8#issuecomment-845239813, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKGMVCYLM2A7L7FK7U2NYTTOUV25ANCNFSM45ASJNRQ .

-- Thomas Girke, Ph.D. Professor of Bioinformatics 1207F Genomics Building University of California Riverside, CA 92521

E-mail: @.*** URL: https://girke.bioinformatics.ucr.edu Phone/Cell/Text: 951-732-7072 Fax: 951-827-4437

tgirke commented 3 years ago

You could just import all of them into an SDFset and then create your own filtering rules based on atom composition, side groups, rings etc. This should be fairly easy.

T.G.

On Thu, May 20, 2021 at 9:20 AM Sukis123 @.***> wrote:

Yes, this structure is quit weird. It is from NIST17 GC-MS library. This structure is not important for me. I am just trying to skip those compounds, but I don’t know which compounds have problem. Best regards,Qi-Zhi Su From: Thomas GirkeSent: Thursday, May 20, 2021 6:06 PMTo: girke-lab/ChemmineRCc: Sukis123; AuthorSubject: Re: [girke-lab/ChemmineR] Converting multiple *.mol files into a single *.sdf. (#8) Somehow this is an unusual structure. Definitely not a drug-like or organiccompound derived from metabolites or similar. When you import it and thenplot the structure with just plot(...) or sdf.visualize(...) then you cansee it. So I am not sure what to suggest here.On Thu, May 20, 2021 at 8:52 AM Sukis123 ***@***.***> wrote:> At the moment, I only identify one compound, but there should be more> because after removing this compound I still have the same problem. I share> a link to download the sdf:> https://unizares-my.sharepoint.com/:u:/g/personal/773609_unizar_es/ETag3-6aWaVGgJNX6QrmBG4BXy2Sj0RmlPR9U2QHFLcRVQ?e=Ov0wSR.> The mol file has already been shared before.>> the system used is:> R version 4.0.5 (2021-03-31)> Platform: x86_64-w64-mingw32/x64 (64-bit)> Running under: Windows 10 x64 (build 19041)>> Matrix products: default>> locale:> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United> States.1252 LC_MONETARY=English_United States.1252> [4] LC_NUMERIC=C LC_TIME=English_United States.1252> system code page: 936>> attached base packages:> [1] stats graphics grDevices utils datasets methods base>> loaded via a namespace (and not attached):> [1] compiler_4.0.5 tools_4.0.5>> I also test it in Ubuntu and I got the same problem. below is the setting> in Ubuntu.>> R version 4.1.0 (2021-05-18)> Platform: x86_64-pc-linux-gnu (64-bit)> Running under: Ubuntu 20.04.2 LTS>> Matrix products: default> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0>> locale:> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8> LC_COLLATE=en_US.UTF-8> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8> LC_NAME=C> [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8> LC_IDENTIFICATION=C>> attached base packages:> [1] stats graphics grDevices utils datasets methods base>> loaded via a namespace (and not attached):> [1] compiler_4.1.0 tools_4.1.0>> —> You are receiving this because you commented.> Reply to this email directly, view it on GitHub> < https://github.com/girke-lab/ChemmineR/issues/8#issuecomment-845239813>,> or unsubscribe> < https://github.com/notifications/unsubscribe-auth/AAKGMVCYLM2A7L7FK7U2NYTTOUV25ANCNFSM45ASJNRQ>> .>-- Thomas Girke, Ph.D.Professor of Bioinformatics1207F Genomics BuildingUniversity of CaliforniaRiverside, CA 92521E-mail: ***@***.***URL: https://girke.bioinformatics.ucr.eduPhone/Cell/Text: 951-732-7072Fax: 951-827-4437—You are receiving this because you authored the thread.Reply to this email directly, view it on GitHub, or unsubscribe. — You are receiving this because you commented. Reply to this email directly, view it on GitHub , or unsubscribe .

-- Thomas Girke, Ph.D. Professor of Bioinformatics 1207F Genomics Building University of California Riverside, CA 92521

E-mail: @.*** URL: https://girke.bioinformatics.ucr.edu Phone/Cell/Text: 951-732-7072 Fax: 951-827-4437

QizhiSu commented 3 years ago

OK, I will try. Many thanks.

Sukis

QizhiSu commented 3 years ago

Dear tgirke,

the problem is what criteria should be used to filter the SDFset. I have tried to removed compounds with the following metals, but it doesn't work.

Filter out compounds with metal elements

metals <- c('Fe', 'Se', 'Ge', 'Hg', 'Zn', 'Co', 'As', 'I', 'Ni', 'Mn', 'V', 'Ga', "Sn", 'B', 'In', 'Cu', 'Cd', 'Pb', 'Pd', 'Pt', 'Ti', 'Al', 'Rh')

tgirke commented 3 years ago

In ChemmineR you get the counts of various properties from an SDFset via various functions summarized here: https://bit.ly/2QExnwF. After this you subset SDFset containing all your compounds based on your filter criteria. Next you might want to save the subsetted SDFset result to an intermediate SDF file with write.SDF.

T.G.

On Fri, May 21, 2021 at 3:57 AM Sukis123 @.***> wrote:

Dear tgirke,

the problem is what criteria should be used to filter the SDFset. I have tried to removed compounds with the following metals, but it doesn't work. Filter out compounds with metal elements

metals <- c('Fe', 'Se', 'Ge', 'Hg', 'Zn', 'Co', 'As', 'I', 'Ni', 'Mn', 'V', 'Ga', "Sn", 'B', 'In', 'Cu', 'Cd', 'Pb', 'Pd', 'Pt', 'Ti', 'Al', 'Rh')

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/girke-lab/ChemmineR/issues/8#issuecomment-845867861, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKGMVBBHMILP7QNO7ROASDTOY4ANANCNFSM45ASJNRQ .

-- Thomas Girke, Ph.D. Professor of Bioinformatics 1207F Genomics Building University of California Riverside, CA 92521

E-mail: @.*** URL: https://girke.bioinformatics.ucr.edu Phone/Cell/Text: 951-732-7072 Fax: 951-827-4437

QizhiSu commented 3 years ago

I finally figure it out by reading all the *.mol files by using convertFormatFile().

Thanks a lot. Tgirke.

cheers Sukis