knime-mpicbg / knime-scripting

knime-scripting includes scripting extensions for KNIME to integrate R, Matlab, Python and Groovy scripts. These extensions include a collection of nodes to develop and deploy templates in the respective languages.
https://github.com/knime-mpicbg/knime-scripting/wiki
Other
43 stars 25 forks source link

Table with text columns containing and starting with missing values produce "NA" values where it should be NA #111

Open stoeter opened 1 year ago

stoeter commented 1 year ago

When a table contains text columns starting in the first rows with missing values then in R theres are correcty parsed to NA. However, when a row contains a text value, then the previous row is parsed as "NA". In consequese the table retured to KNIME contains "NA" entries where before R snippet there was a missing value. This is a bug.

Then checked on R side this happens already in the parsed kIn table (from the data provided below): kIn$Name (Sense)[1:20] [1] NA NA NA NA NA NA NA NA NA
[10] NA NA NA "NA" "seq_001" "seq_002" "seq_003" "seq_004" "seq_005" [19] "seq_006" "seq_007"

Sorting the table in KNIME in such a way that missing values are at the end of the table solves the issue for that column.

Attached KNIME workflow with data showing the problem...

Win7 KNIME 4.5.2 R version 3.6.1 (2019-07-05) Rserve 1.8-6

R snippted problem with NAs.zip

stoeter commented 1 year ago

Correction: the issue cannot be solved by pre-sorting. It looks like the last row is also parsed as "NA" instead of NA

tail(kIn) plateRow plateColumn Name (Sense) Sense sequence Name (Antisense) Reverse complement Row90#1? 8 7 Row91#1? 8 8 Row92#1? 8 9 Row93#1? 8 10 Row94#1? 8 11 Row95#1? 8 12 NA NA NA NA sample number sample number.text sample type plate number plate number.text Row90#1? NA H2O 2 002 Row91#1? NA H2O 2 002 Row92#1? NA H2O 2 002 Row93#1? NA H2O 2 002 Row94#1? NA H2O 2 002 Row95#1? NA H2O 2 002 sample number in plate plateRow384 plateColumn384 plateNumber384 Row90#1? NA 15 14 1 Row91#1? NA 15 16 1 Row92#1? NA 15 18 1 Row93#1? NA 15 20 1 Row94#1? NA 15 22 1 Row95#1? NA 15 24 1 kIn$Name (Sense)[180:192] [1] NA NA NA NA NA NA NA NA NA NA NA NA "NA"

stoeter commented 1 year ago

pasted code from command line was not properliy displayed in previous comment:

tail(kIn)
          plateRow plateColumn Name (Sense) Sense sequence Name (Antisense) Reverse complement
Row90#1_?        8           7         <NA>           <NA>             <NA>               <NA>
Row91#1_?        8           8         <NA>           <NA>             <NA>               <NA>
Row92#1_?        8           9         <NA>           <NA>             <NA>               <NA>
Row93#1_?        8          10         <NA>           <NA>             <NA>               <NA>
Row94#1_?        8          11         <NA>           <NA>             <NA>               <NA>
Row95#1_?        8          12           NA             NA               NA                 NA
          sample number sample number.text sample type plate number plate number.text
Row90#1_?            NA               <NA>         H2O            2               002
Row91#1_?            NA               <NA>         H2O            2               002
Row92#1_?            NA               <NA>         H2O            2               002
Row93#1_?            NA               <NA>         H2O            2               002
Row94#1_?            NA               <NA>         H2O            2               002
Row95#1_?            NA               <NA>         H2O            2               002
          sample number in plate plateRow384 plateColumn384 plateNumber384
Row90#1_?                     NA          15             14              1
Row91#1_?                     NA          15             16              1
Row92#1_?                     NA          15             18              1
Row93#1_?                     NA          15             20              1
Row94#1_?                     NA          15             22              1
Row95#1_?                     NA          15             24              1
> kIn$`Name (Sense)`[180:192]
 [1] NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   "NA"
stoeter commented 1 year ago

Sorry, the bug is even more severe. It looks like the last row with a value that is followed by a missing value is parsed as a missing value. In this example there are 10 sequences with their names. seq_010 was replaced by a missing value, NA, in kIn.

In addition what is very strange is that the column sampleNumber.text, which is also a text column looks fine!???

At this point is is also not clear to me what NA in R means:

: missing value / not available (ok) "NA" : text "NA" (ok) NA : appears in kIn$`Name (Sense)`[1:30] (, but not in kIn$`sample number.text`[1:30]), whereas that data displayed as kIn[1:30,] shows in table (why is it NA and not when printed as vector? does this have anything to do with the problem?) Here is the example: ``` > kIn[1:30,] plateRow plateColumn Name (Sense) Sense sequence Name (Antisense) Reverse complement Row0#0_? 1 1 Row1#0_? 1 2 Row2#0_? 1 3 Row3#0_? 1 4 Row4#0_? 1 5 Row5#0_? 1 6 Row6#0_? 1 7 Row7#0_? 1 8 Row8#0_? 1 9 Row9#0_? 1 10 Row10#0_? 1 11 Row11#0_? 1 12 Row12#0_? 2 1 NA NA NA NA Row13#0_Row0 2 2 seq_001 GAGAAAAGGAAGCTG anti-seq_001 CAGCTTCCTTTTCTC Row14#0_Row1 2 3 seq_002 GAGGAGTTGAAGCTG anti-seq_002 CAGCTTCAACTCCTC Row15#0_Row2 2 4 seq_003 GAGGAAAGGAAGCTG anti-seq_003 CAGCTTCCTTTCCTC Row16#0_Row3 2 5 seq_004 GAGATCAGAAAGCTG anti-seq_004 CAGCTTTCTGATCTC Row17#0_Row4 2 6 seq_005 GAGGGGTGGGGGCTG anti-seq_005 CAGCCCCCACCCCTC Row18#0_Row5 2 7 seq_006 GAGAAGAGGAAGCTG anti-seq_006 CAGCTTCCTCTTCTC Row19#0_Row6 2 8 seq_007 GAGAAATGGAGGCTG anti-seq_007 CAGCCTCCATTTCTC Row20#0_Row7 2 9 seq_008 GAGAAGAGGGAGCTG anti-seq_008 CAGCTCCCTCTTCTC Row21#0_Row8 2 10 seq_009 GAGAGGAGGGAGCTG anti-seq_009 CAGCTCCCTCCTCTC Row22#0_Row9 2 11 Row23#0_? 2 12 Row24#0_? 3 1 NA NA NA NA Row25#0_Row10 3 2 seq_011 GAGGGAAGGAGGCTG anti-seq_011 CAGCCTCCTTCCCTC Row26#0_Row11 3 3 seq_012 GAGGGATGGAGGCTG anti-seq_012 CAGCCTCCATCCCTC Row27#0_Row12 3 4 seq_013 GAGGGGAGGGAGCTG anti-seq_013 CAGCTCCCTCCCCTC Row28#0_Row13 3 5 seq_014 GAGGGGTGGAGGCTG anti-seq_014 CAGCCTCCACCCCTC Row29#0_Row14 3 6 seq_015 GAGGGAGCTTGGCTG anti-seq_015 CAGCCAAGCTCCCTC sample number sample number.text sample type plate number plate number.text Row0#0_? NA H2O 1 001 Row1#0_? NA H2O 1 001 Row2#0_? NA H2O 1 001 Row3#0_? NA H2O 1 001 Row4#0_? NA H2O 1 001 Row5#0_? NA H2O 1 001 Row6#0_? NA H2O 1 001 Row7#0_? NA H2O 1 001 Row8#0_? NA H2O 1 001 Row9#0_? NA H2O 1 001 Row10#0_? NA H2O 1 001 Row11#0_? NA H2O 1 001 Row12#0_? NA H2O 1 001 Row13#0_Row0 1 001 dsDNA 1 001 Row14#0_Row1 2 002 dsDNA 1 001 Row15#0_Row2 3 003 dsDNA 1 001 Row16#0_Row3 4 004 dsDNA 1 001 Row17#0_Row4 5 005 dsDNA 1 001 Row18#0_Row5 6 006 dsDNA 1 001 Row19#0_Row6 7 007 dsDNA 1 001 Row20#0_Row7 8 008 dsDNA 1 001 Row21#0_Row8 9 009 dsDNA 1 001 Row22#0_Row9 10 010 dsDNA 1 001 Row23#0_? NA H2O 1 001 Row24#0_? NA H2O 1 001 Row25#0_Row10 11 011 dsDNA 1 001 Row26#0_Row11 12 012 dsDNA 1 001 Row27#0_Row12 13 013 dsDNA 1 001 Row28#0_Row13 14 014 dsDNA 1 001 Row29#0_Row14 15 015 dsDNA 1 001 > kIn$`Name (Sense)`[1:30] [1] NA NA NA NA NA NA NA NA NA [10] NA NA NA "NA" "seq_001" "seq_002" "seq_003" "seq_004" "seq_005" [19] "seq_006" "seq_007" "seq_008" "seq_009" NA NA "NA" "seq_011" "seq_012" [28] "seq_013" "seq_014" "seq_015" > kIn$`sample number.text`[1:30] [1] 001 002 003 004 005 006 [20] 007 008 009 010 011 012 013 014 015 96 Levels: 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 ... 096 ```
stoeter commented 3 months ago

Also saw this today again in newer software versions. To me this is a severe bug, because it changes the data where it should not, and the user might not even notice! (=> changed Priority)

Win11 Pro KNIME 4.7.8 R version 4.3.2 (2023-10-31 ucrt) Rserve 1.8-13

Current example of string column parsed to R from KNIME and back to KNIME (all "NA" were previously ?): TDS-DB.ID NA 101_MSD_3_5 101_MSD_3_6 101_MSD_3_7 ? NA 101_MSD_3_9 101_MSD_3_10 101_MSD_3_11 ? NA 101_MSD_3_13 101_MSD_3_14 101_MSD_3_15 101_MSD_3_16 ? NA 101_MSD_3_18

stoeter commented 3 months ago

Ok, just realized the full problem of the bug again: a) certain missing values are parsed to R as "NA" (minor bug, annoying, but one could correct this) b) certain values (strings, rows, that are followed by a missing value) are parsed to R as NA (missing value in R) and are therefore lost! (severe bug!)