iocellnetworks / ndas4linux

The Linux driver and device management program for NDAS devices such as the NetDISK.
GNU General Public License v2.0
13 stars 12 forks source link

kernel_thread is missing #4

Open iocellnetworks opened 11 years ago

iocellnetworks commented 11 years ago

Several times error comes up in attempting to compile after exporting the build_... folder.

WARNING: "kernel_thread" [/home/david/github/ndas4linux/3.7-rc5/build_x86_64_linux/ndas-3.7.0.x86_64.dbg/ndas_sal.ko] undefined! Invoking make against the kernel at /lib/modules/3.7.0-0.rc5.git1.3.fc19.x86_64/build

grep for the location shows [fedora-tmp ndas-3.7.0.x86_64.dbg]$ grep kernel_thread ./ -rn Binary file ./sal/thread.o matches ./sal/thread.c:34:#include <linux/sched.h> // kernel_thread ./sal/thread.c:51: if ((ret = kernel_thread((int (*)(void*))f, (void *)arg, flags)) < 0) { Binary file ./ndas_sal.ko matches Binary file ./ndas_sal.o matches [fedora-tmp ndas-3.7.0.x86_64.dbg]$

So we should look to find what happened to kernel_thread in linux/sched.h

prashants commented 11 years ago

I have not yet checked it properly.

In the sched.h it say

2295 #ifdef CONFIG_GENERIC_KERNEL_THREAD 2296 extern pid_t kernel_thread(int (fn)(void ), void *arg, unsigned long flags); 2297 #endif

Maybe kthread_thread is valid only if CONFIG_GENERIC_KERNEL_THREAD is set. Can you check your kernel .config file for the value of CONFIG_GENERIC_KERNEL_THREAD ?

prashants commented 11 years ago

I checked with older kernel and CONFIG_GENERIC_KERNEL_THREAD parameter didn't exists. I guess its something new in the latest kernel.

iocellnetworks commented 11 years ago

Maybe not a new thing. After searching a little bit on the net, I find that kernel_thread is more like an old constructor. There are some patches here and there which change kernel thread to kthread_run or kthread_create. On board mentioned using one or the other.

here is one example from linux usb

http://www.mail-archive.com/linux-usb-users@lists.sourceforge.net/msg16861.html

So, I looked briefly in our platform/linux/tarball-tag/sal/thread.c and find the function. I think it is creating the process id for a thread, and then casting it into the value "tid." Probably, all we need to do is correctly change the kernel_thread into the best kthread_run or create, and this part will be ok.

The same function is called in platform/linux/xixfs/super.c a couple times too, but we are using that xixfs practically right now. It is a file system function and we are not running that file system now.

But the other place, is /ndasfat/ndft.c This may or may not be used at this time. I do see the same sort of .dll in use on the windows ndas clients.

Please let me know what you think of this assessment.

iocellnetworks commented 11 years ago

Here is a disappointing result. When grepped my Fedora Rawhide .config file it shows the proper parameter in effect.

[fedora-tmp 3.7-rc5]$ grep KERNEL_THREAD  /usr/src/kernels/3.7.0-0.rc5.git1.3.fc19.x86_64/.config
CONFIG_GENERIC_KERNEL_THREAD=y
[fedora-tmp 3.7-rc5]$ 

So, I think if it were compiling against that, the function should be available.

iocellnetworks commented 11 years ago

I put in a kthread_create() based on linux's nbd and loop existing code (and several other patches I found in the threads.) It worked. There was a kernel error though, and the system was pretty sluggish while copying data.

It might just be the numerous debug messages being written or could be that there is no genuine kthread_stop or similar being used.

This is the kernel debug output reported by Fedora:

BUG: soft lockup - CPU#1 stuck for 23s! [kwin:986]
Modules linked in: ndas_block(OF) ndas_core(POF) ndas_sal(OF) psnap p8022 llc lockd sunrpc bnep bluetooth rfkill ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 ip6table_filter ip6_tables xt_conntrack nf_conntrack snd_hda_codec_idt coretemp snd_hda_intel kvm_intel snd_hda_codec kvm snd_hwdep snd_seq snd_seq_device snd_pcm microcode ppdev snd_page_alloc iTCO_wdt snd_timer iTCO_vendor_support usblp snd lpc_ich parport_pc i2c_i801 mei parport serio_raw r8169 mfd_core mii soundcore e1000e uinput firewire_ohci(F) ata_generic(F) firewire_core(F) pata_acpi(F) crc_itu_t(F) sata_sil24(F) i915(F) pata_marvell(F) usb_storage(F) uas(F) video(F) i2c_algo_bit(F) drm_kms_helper(F) drm(F) i2c_core(F)
irq event stamp: 0
hardirqs last  enabled at (0): [<          (null)>]           (null)
hardirqs last disabled at (0): [<ffffffff8106680f>] copy_process.part.21+0x5cf/0x16e0
softirqs last  enabled at (0): [<ffffffff8106680f>] copy_process.part.21+0x5cf/0x16e0
softirqs last disabled at (0): [<          (null)>]           (null)
CPU 1 
Pid: 986, comm: kwin Tainted: PF          O 3.7.0-0.rc5.git2.1.fc19.x86_64 #1                  /DG965OT
RIP: 0010:[<ffffffff816f2224>]  [<ffffffff816f2224>] _raw_spin_unlock_irqrestore+0x44/0x80
RSP: 0018:ffff8800bde037d0  EFLAGS: 00000286
RAX: 0000000000000000 RBX: ffff880000000010 RCX: dead000000200200
RDX: 000000000000a4a9 RSI: 0000000000000001 RDI: 0000000000000286
RBP: ffff8800bde037e0 R08: ffff8800b96f1a50 R09: ffff8800b8d40c30
R10: ffff8800b96cef40 R11: 000000000000001c R12: ffff8800bde03748
R13: ffffffff816fc432 R14: ffff8800bde037e0 R15: ffff8800b96f1a00
FS:  00007fd1ca26f880(0000) GS:ffff8800bde00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7aea6aa380 CR3: 0000000090664000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kwin (pid: 986, threadinfo ffff8800b3324000, task ffff8800b2328000)
Stack:
 ffff8800b8d40b40 ffffea0002e35000 ffff8800bde038a0 ffffffff816e8207
 ffff8800bde03810 00000001001c001b 00000001001c001c ffff8800b8d40b00
 ffff8800b96f1a00 0000000000000046 ffff880000000000 0000000000000286
Call Trace:

There were some examples with the locks implemented right along with the thread creation, but I was imagining that we put the lock in place earlier, in sal/io.c, using the mutex code. So, I hope it is just caused by some delay after writing a long log.

I think 23s for a kernel hang is unacceptable.

iocellnetworks commented 11 years ago

Here is a diff -rupN output against version 3.4.6. It shows the changes I did so far.

diff -rupN 3.4.6/ndasfat/ndft.c 3.7-rc5/ndasfat/ndft.c
--- 3.4.6/ndasfat/ndft.c        2012-11-16 16:07:14.362145860 +0000
+++ 3.7-rc5/ndasfat/ndft.c      2012-11-19 03:34:13.244733391 +0000
@@ -1,4 +1,5 @@
 #include "ndasfatproc.h"
+#include <linux/kthread.h>

 NDFT   GlobalNdft = { {0} };

@@ -14,7 +15,8 @@ NdftInit (
 {
        int     error;
        pid_t   pid;
-
+               struct task_struct *tsk;
+
        NDASFAT_PRINTK( 0, ("enter\n") );

        init_completion( &GlobalNdft.Thread.Alive );
@@ -26,19 +28,21 @@ NdftInit (
        spin_lock_init( &GlobalNdft.Thread.NIQSpinLock );
        INIT_LIST_HEAD( &GlobalNdft.Thread.NIQueue );

-       pid = kernel_thread( NdftThread, &GlobalNdft, 0 );
-
-       if (pid > 0) {
-
-               error = 0;
-               GlobalNdft.Thread.Pid = pid;
-
-       } else {
+       //pid = kernel_thread( NdftThread, &GlobalNdft, 0 );
+       //      if (pid > 0) {
+       tsk = kthread_creat( NdftThread, &GlobalNdft, "ndasft" );
+       if (IS_ERR(tsk)) {

                NDAS_ASSERT( false );
                error = pid ? pid : -ECHILD;
-
                return error;
+
+       } else {
+
+               wake_up_process(tsk); 
+   error = 0;
+               GlobalNdft.Thread.Pid = pid;
+
        }

        wait_for_completion( &GlobalNdft.Thread.Alive );
diff -rupN 3.4.6/platform/linux/tarball-tag/sal/thread.c 3.7-rc5/platform/linux/tarball-tag/sal/thread.c
--- 3.4.6/platform/linux/tarball-tag/sal/thread.c       2012-11-16 16:07:14.414147059 +0000
+++ 3.7-rc5/platform/linux/tarball-tag/sal/thread.c     2012-11-19 03:44:14.621726181 +0000
@@ -32,6 +32,8 @@
 #include <linux/version.h>
 #include <linux/module.h> // EXPORT_SYMBOL
 #include <linux/sched.h> // kernel_thread
+#include <linux/kthread.h> // kthread
+
 #include <linux/errno.h> // /usr/include/errno.h , ENOMEM
 #include "linux_ver.h"
 #include "sal/types.h"
@@ -43,29 +45,25 @@
 NDAS_SAL_API
 int 
 sal_thread_create(
-    sal_thread_id* tid, char* name, int prio, int stacksize,
-    sal_thread_func f, void* arg)
+    sal_thread_id *tid, char *name, int prio, int stacksize,
+    sal_thread_func f, void *arg)
 {
-    unsigned long flags = 0L; // ignored in 2.4 - 2.6
-    pid_t ret = 0;
-    if ((ret = kernel_thread((int (*)(void*))f, (void *)arg, flags)) < 0) {
-        
-        switch (ret) {
-        case ENOMEM:
-            printk("ndas: fail to create kernel thread: out of memory\n");
-            return NDAS_ERROR_OUT_OF_MEMORY;
-        case EAGAIN:
-            // todo
-        case EINVAL:
-        default:
-            break;
+        unsigned long flags = 0L; // ignored in 2.4 - 2.6
+//    pid_t ret = 0;
+//    if ((ret = kernel_thread((int (*)(void*))f, (void *)arg, flags)) < 0) {
+        struct task_struct *tsk;
+        tsk = kthread_create((int (*)(void*))f, (void *)arg , name);
+        if (IS_ERR(tsk)) {
+                printk("ndas: sal_thread_create\n");
+                printk("ndas: failed to create kthread: reason unkown.\n");
+                printk("ndas: sal_thread_create: name=%s\n", name);
+                printk("ndas: sal_thread_create: tsk=%d\n", tsk);
+                return NDAS_ERROR_INTERNAL;
         }
-        printk("ndas: fail to create kernel thread\n");
-        return NDAS_ERROR_INTERNAL;
-    }
-    if (tid)
-        *tid = (sal_thread_id)(long)ret;
-    return NDAS_OK;
+        wake_up_process(tsk); 
+        if (tid)
+                *tid = (sal_thread_id)(long)tsk;
+        return NDAS_OK;
 }
 EXPORT_SYMBOL(sal_thread_create);
prashants commented 11 years ago

Yes its better to replace with the kthread_create. Can you check if kthread_run() will suite better, you wont have to do wake_up_process()

I could find the following threads in the code : ndiod_thread, emulisten, genpnp, con session, pnp, dpc

You can monitor the resources these kthread's are taking from the ps command. Maybe you will get some idea which one is trashing the system.

I will check the patch by tom.

prashants commented 11 years ago

It easier for me to apply the git diffs :)

Did you see the error before or its after applying the change ?

Its some locking issues related to spin locks. Check the ndft.c which has a few spin locks.

line 308 spin_lock( &GlobalNdft.Thread.NIQSpinLock );

I dont see any corresponding unlock on the same file. Can you check the locking flow in the code ?